Amazon Redshift is a massively-scalable data warehouse platform that works well for the ad-hoc analysis demanded by business intelligence. Unfortunately, it lacks geospatial features that you expect from an enterprise SQL database engine.
Python UDF support provides an avenue for users to manually implement geospatial functions, though this approach requires knowledge of geospatial libraries and/or complex mathematical equations. It also results in queries that suffer from performance issues, impacting scalability.
One function that is quite useful in geospatial analytics involves matching, ordering or filtering data based on proximity from a given point of interest, or a set of neighbouring points in the dataset.
Putting aside the fact that the earth is not flat, or round, or even a perfect ellipsoid, and that geospatial coordinates are typically represented in latitude and longitude (ie WSG84), an approximation between (relatively) close points is typically good enough for most real world use-cases. We can then avoid complex tensor math and put aside our geospatial datum.
In most scenarios, the exact distance between points is not required and a relative figure is an adequate comparator. In many use-cases a small margin of error is also quite acceptable.
Pythagoras to the rescue!
If you remember back to trigonometry class, you'll know that the distance between two Cartesian points can be calculated using the distance formula, which is derived from Pythagorean theorem.
Basic distance between two points:
Furthermore, if we are only interested in a relative distance for ordering results, the difference between latitude and longitude will provide enough information for a comparison. This also means that the square-root function is no longer required:
If slightly better performance is more important than accuracy, we can further optimise the calculation by replacing squares with absolute values. However, when tested with small datasets, this optimisation resulted in very little performance improvement (measured using explain).
How does this formula translate to SQL code? Below are some tips to implement this approach in Amazon Redshift.
The following table contains a handful of points of interest, which we wish to order by distance from the Brisbane General Post Office (GPO).
Using PostGIS as a benchmark
Using PostGIS, we benchmark the distance from each point to the Brisbane
GPO via the
ST_Distance_Spheroid function. The results are
listed in the below table, ordered from the closest point to the furthest.
|PostGIS Spheroid Distance (WSG 84) to Brisbane GPO|
create table geo.bne_points ( id serial not null constraint bne_points_pkey primary key, name varchar, latitude double precision, longitude double precision ); create unique index bne_points_id_uindex on bne_points(id); insert into geo.bne_points(id, name, latitude, longitude) values (1,'Brisbane GPO', -27.4679, 153.0279), (2,'The Gabba', -27.4858, 153.0381), (3,'Mt Cootha', -27.4678, 152.9390), (4,'Surfers Paradise', -28.0018, 153.4284), (5,'South Bank', -27.4766, 153.0222), (6,'Toowoomba', -27.5598, 151.9507), (7,'Brisbane Airport', -27.3942, 153.1218), (8,'Suncorp Stadium', -27.4648, 153.0095);
Loading into Redshift and ordering
Being based on PostgreSQL, Redshift uses similar syntax to load the sample data:
create table geo.bne_points( id integer, name varchar, latitude double precision, longitude double precision ); insert into geo.bne_points(id, name, latitude, longitude) values (1,'Brisbane GPO', -27.4679, 153.0279), (2,'The Gabba', -27.4858, 153.0381), (3,'Mt Cootha', -27.4678, 152.9390), (4,'Surfers Paradise', -28.0018, 153.4284), (5,'South Bank', -27.4766, 153.0222), (6,'Toowoomba', -27.5598, 151.9507), (7,'Brisbane Airport', -27.3942, 153.1218), (8,'Suncorp Stadium', -27.4648, 153.0095);
The following query will order the points of interest using the relative distance formula:
select id, name, p1.latitude, p1.longitude, sqrt( -- square root is optional power(p2.latitude - p1.latitude, 2) + power(p2.longitude - p1.longitude, 2) ) as d_relative from ( select id, name, latitude, longitude from geo.bne_points ) p1 join ( select -27.4679 as latitude, 153.0279 as longitude ) p2 on true -- append BNE GPO to all rows order by d_relative limit 100;
This method results in identical ordering to the PostGIS benchmark when using the sample dataset:
|Redshift Relative Distance Ordering (to Brisbane GPO)|
Proximity ordering can easily be adapted to find the nearest neighbouring point for all points of interest in the sample dataset.
Unfortunately, this is an exponential computation, but should perform fine for small datasets. Converting to a view or parameterised query may provide a suitable alternative for larger datasets, where computing the entire result-set isn't required and where lazy-evaluation is acceptible.
select distinct id, name, first_value(p2_name) over ( partition by id order by d_rel rows between unbounded preceding and unbounded following ) as nearest_to from ( select p1.id, p1.name, p1.latitude, p1.longitude, p2.name as p2_name, sqrt( power(p2.latitude - p1.latitude, 2) + power(p2.longitude - p1.longitude, 2) ) as d_rel from ( select id, name, latitude, longitude from geo.bne_points ) p1 join ( select id, name, latitude, longitude from geo.bne_points ) p2 on p1.id != p2.id ) order by id limit 100;
The sample query uses the
first_value window function to
select the first neighbour for each
id, when sorted by the
relative distance to that point (
The results of this query are provided in the table below:
|Redshift Nearest Neighbour (first_value)|
|1||Brisbane GPO||South Bank|
|2||The Gabba||South Bank|
|3||Mt Cootha||Suncorp Stadium|
|4||Surfers Paradise||The Gabba|
|5||South Bank||Brisbane GPO|
|7||Brisbane Airport||Brisbane GPO|
|8||Suncorp Stadium||South Bank|
Based on these results, South Bank would be the best place to meet for a group of people travelling from all of the points in the sample dataset.
If you have any queries or feedback related to this article, or would like some assistance with your data project, please Get in Touch to discuss.