Amazon Redshift is a massively-scalable data warehouse platform that works well for the ad-hoc analysis demanded by business intelligence. Unfortunately, it lacks geospatial features that you expect from an enterprise SQL database engine.
Python UDF support provides an avenue for users to manually implement geospatial functions, though this approach requires knowledge of geospatial libraries and/or complex mathematical equations. It also results in queries that suffer from performance issues, impacting scalability.
Proximity search
One function that is quite useful in geospatial analytics involves matching, ordering or filtering data based on proximity from a given point of interest, or a set of neighbouring points in the dataset.
Putting aside the fact that the earth is not flat, or round, or even a perfect ellipsoid, and that geospatial coordinates are typically represented in latitude and longitude (ie WSG84), an approximation between (relatively) close points is typically good enough for most real world use-cases. We can then avoid complex tensor math and put aside our geospatial datum.
In most scenarios, the exact distance between points is not required and a relative figure is an adequate comparator. In many use-cases a small margin of error is also quite acceptable.
Pythagoras to the rescue!
If you remember back to trigonometry class, you'll know that the distance between two Cartesian points can be calculated using the distance formula, which is derived from Pythagorean theorem.
Basic distance between two points:
Furthermore, if we are only interested in a relative distance for ordering results, the difference between latitude and longitude will provide enough information for a comparison. This also means that the square-root function is no longer required:
If slightly better performance is more important than accuracy, we can further optimise the calculation by replacing squares with absolute values. However, when tested with small datasets, this optimisation resulted in very little performance improvement (measured using explain).
How does this formula translate to SQL code? Below are some tips to implement this approach in Amazon Redshift.
Redshift Implementation
The following table contains a handful of points of interest, which we wish to order by distance from the Brisbane General Post Office (GPO).
ID | Name | Latitude | Longitude |
---|---|---|---|
1 | Brisbane GPO | -27.4679 | 153.0279 |
2 | The Gabba | -27.4858 | 153.0381 |
3 | Mt Cootha | -27.4678 | 152.9390 |
4 | Surfers Paradise | -28.0018 | 153.4284 |
5 | South Bank | -27.4766 | 153.0222 |
6 | Toowoomba | -27.5598 | 151.9507 |
7 | Brisbane Airport | -27.3942 | 153.1218 |
8 | Suncorp Stadium | -27.4648 | 153.0095 |
Using PostGIS as a benchmark
Using PostGIS, we benchmark the distance from each point to the Brisbane
GPO via the ST_Distance_Spheroid
function. The results are
listed in the below table, ordered from the closest point to the furthest.
PostGIS Spheroid Distance (WSG 84) to Brisbane GPO | ||||
ID | Name | Latitude | Longitude | km |
---|---|---|---|---|
1 | Brisbane GPO | -27.4679 | 153.0279 | 0.0 |
5 | South Bank | -27.4766 | 153.0222 | 1.12 |
8 | Suncorp Stadium | -27.4648 | 153.0095 | 1.85 |
2 | The Gabba | -27.4858 | 153.0381 | 2.22 |
3 | Mt Cootha | -27.4678 | 152.9390 | 8.79 |
7 | Brisbane Airport | -27.3942 | 153.1218 | 12.36 |
4 | Surfers Paradise | -28.0018 | 153.4284 | 71.13 |
6 | Toowoomba | -27.5598 | 151.9507 | 106.91 |
create table geo.bne_points (
id serial not null
constraint bne_points_pkey primary key,
name varchar,
latitude double precision,
longitude double precision
);
create unique index bne_points_id_uindex on bne_points(id);
insert into geo.bne_points(id, name, latitude, longitude)
values (1,'Brisbane GPO', -27.4679, 153.0279),
(2,'The Gabba', -27.4858, 153.0381),
(3,'Mt Cootha', -27.4678, 152.9390),
(4,'Surfers Paradise', -28.0018, 153.4284),
(5,'South Bank', -27.4766, 153.0222),
(6,'Toowoomba', -27.5598, 151.9507),
(7,'Brisbane Airport', -27.3942, 153.1218),
(8,'Suncorp Stadium', -27.4648, 153.0095);
Loading into Redshift and ordering
Being based on PostgreSQL, Redshift uses similar syntax to load the sample data:
create table geo.bne_points(
id integer,
name varchar,
latitude double precision,
longitude double precision
);
insert into geo.bne_points(id, name, latitude, longitude)
values (1,'Brisbane GPO', -27.4679, 153.0279),
(2,'The Gabba', -27.4858, 153.0381),
(3,'Mt Cootha', -27.4678, 152.9390),
(4,'Surfers Paradise', -28.0018, 153.4284),
(5,'South Bank', -27.4766, 153.0222),
(6,'Toowoomba', -27.5598, 151.9507),
(7,'Brisbane Airport', -27.3942, 153.1218),
(8,'Suncorp Stadium', -27.4648, 153.0095);
The following query will order the points of interest using the relative distance formula:
select id, name, p1.latitude, p1.longitude,
sqrt( -- square root is optional
power(p2.latitude - p1.latitude, 2) +
power(p2.longitude - p1.longitude, 2)
) as d_relative
from (
select id, name, latitude, longitude
from geo.bne_points
) p1
join (
select -27.4679 as latitude, 153.0279 as longitude
) p2 on true -- append BNE GPO to all rows
order by d_relative
limit 100;
This method results in identical ordering to the PostGIS benchmark when using the sample dataset:
Redshift Relative Distance Ordering (to Brisbane GPO) | ||||
ID | Name | Latitude | Longitude | Distance (relative) |
---|---|---|---|---|
1 | Brisbane GPO | -27.4679 | 153.0279 | 0.0000 |
5 | South Bank | -27.4766 | 153.0222 | 0.0104 |
8 | Suncorp Stadium | -27.4648 | 153.0095 | 0.01866 |
2 | The Gabba | -27.4858 | 153.0381 | 0.0206 |
3 | Mt Cootha | -27.4678 | 152.9390 | 0.0889 |
7 | Brisbane Airport | -27.3942 | 153.1218 | 0.1194 |
4 | Surfers Paradise | -28.0018 | 153.4284 | 0.6674 |
6 | Toowoomba | -27.5598 | 151.9507 | 1.0811 |
Nearest Neighbour?
Proximity ordering can easily be adapted to find the nearest neighbouring point for all points of interest in the sample dataset.
Unfortunately, this is an exponential computation, but should perform fine for small datasets. Converting to a view or parameterised query may provide a suitable alternative for larger datasets, where computing the entire result-set isn't required and where lazy-evaluation is acceptible.
select distinct id, name,
first_value(p2_name) over (
partition by id
order by d_rel
rows between unbounded preceding and unbounded following
) as nearest_to
from (
select p1.id, p1.name, p1.latitude, p1.longitude,
p2.name as p2_name,
sqrt(
power(p2.latitude - p1.latitude, 2) +
power(p2.longitude - p1.longitude, 2)
) as d_rel
from (
select id, name, latitude, longitude
from geo.bne_points
) p1
join (
select id, name, latitude, longitude
from geo.bne_points
) p2 on p1.id != p2.id
)
order by id
limit 100;
The sample query uses the first_value
window function to
select the first neighbour for each id
, when sorted by the
relative distance to that point (d_rel
).
The results of this query are provided in the table below:
Redshift Nearest Neighbour (first_value) | ||
ID | Name | Nearest To |
---|---|---|
1 | Brisbane GPO | South Bank |
2 | The Gabba | South Bank |
3 | Mt Cootha | Suncorp Stadium |
4 | Surfers Paradise | The Gabba |
5 | South Bank | Brisbane GPO |
6 | Toowoomba | Mt Cootha |
7 | Brisbane Airport | Brisbane GPO |
8 | Suncorp Stadium | South Bank |
Based on these results, South Bank would be the best place to meet for a group of people travelling from all of the points in the sample dataset.
If you have any queries or feedback related to this article, or would like some assistance with your data project, please Get in Touch to discuss.