#### Emulating Geospatial Proximity Ordering in Amazon Redshift

July 24, 2017 Posted by: Amazon Redshift is a massively-scalable data warehouse platform that works well for the ad-hoc analysis demanded by business intelligence. Unfortunately, it lacks geospatial features that you expect from an enterprise SQL database engine.

Python UDF support provides an avenue for users to manually implement geospatial functions, though this approach requires knowledge of geospatial libraries and/or complex mathematical equations. It also results in queries that suffer from performance issues, impacting scalability.

##### Proximity search

One function that is quite useful in geospatial analytics involves matching, ordering or filtering data based on proximity from a given point of interest, or a set of neighbouring points in the dataset.

Putting aside the fact that the earth is not flat, or round, or even a perfect ellipsoid, and that geospatial coordinates are typically represented in latitude and longitude (ie WSG84), an approximation between (relatively) close points is typically good enough for most real world use-cases. We can then avoid complex tensor math and put aside our geospatial datum.

In most scenarios, the exact distance between points is not required and a relative figure is an adequate comparator. In many use-cases a small margin of error is also quite acceptable.

##### Pythagoras to the rescue!

If you remember back to trigonometry class, you'll know that the distance between two Cartesian points can be calculated using the distance formula, which is derived from Pythagorean theorem.

###### Basic distance between two points:

$D=\sqrt{\phantom{\rule{1em}{0ex}}{\mathrm{\left({x}_{2}-{x}_{1}\right)}}^{2}+{\mathrm{\left({y}_{2}-{y}_{1}\right)}}^{2}}\phantom{\rule{0ex}{0ex}}\text{where:}\phantom{\rule{0ex}{0ex}}{P}_{1}=\left({x}_{1},{y}_{1}\right)\text{,}\phantom{\rule{1em}{0ex}}{P}_{2}=\left({x}_{2},{y}_{2}\right)$

Furthermore, if we are only interested in a relative distance for ordering results, the difference between latitude and longitude will provide enough information for a comparison. This also means that the square-root function is no longer required:

${d}_{relative}={\mathrm{\left({x}_{2}-{x}_{1}\right)}}^{2}+{\mathrm{\left({y}_{2}-{y}_{1}\right)}}^{2}$

If slightly better performance is more important than accuracy, we can further optimise the calculation by replacing squares with absolute values. However, when tested with small datasets, this optimisation resulted in very little performance improvement (measured using explain).

${d}_{approx}=|{x}_{2}-{x}_{1}|+|{y}_{2}-{y}_{1}|$

How does this formula translate to SQL code? Below are some tips to implement this approach in Amazon Redshift.

### Redshift Implementation

The following table contains a handful of points of interest, which we wish to order by distance from the Brisbane General Post Office (GPO).

IDNameLatitudeLongitude
1Brisbane GPO-27.4679153.0279
2The Gabba-27.4858153.0381
3Mt Cootha-27.4678152.9390
5South Bank-27.4766153.0222
6Toowoomba-27.5598151.9507
7Brisbane Airport-27.3942153.1218

#### Using PostGIS as a benchmark

Using PostGIS, we benchmark the distance from each point to the Brisbane GPO via the ST_Distance_Spheroid function. The results are listed in the below table, ordered from the closest point to the furthest.

ID Name Latitude Longitude km PostGIS Spheroid Distance (WSG 84) to Brisbane GPO 5 South Bank -27.4766 153.0222 1.12 8 Suncorp Stadium -27.4648 153.0095 1.85 2 The Gabba -27.4858 153.0381 2.22 3 Mt Cootha -27.4678 152.9390 8.79 7 Brisbane Airport -27.3942 153.1218 12.36 4 Surfers Paradise -28.0018 153.4284 71.13 6 Toowoomba -27.5598 151.9507 106.91
create table geo.bne_points (
id serial not null
constraint bne_points_pkey primary key,
name varchar,
latitude double precision,
longitude double precision
);
create unique index bne_points_id_uindex on bne_points(id);

insert into	geo.bne_points(id, name, latitude, longitude)
values	(1,'Brisbane GPO', -27.4679, 153.0279),
(2,'The Gabba', -27.4858, 153.0381),
(3,'Mt Cootha', -27.4678, 152.9390),
(4,'Surfers Paradise', -28.0018, 153.4284),
(5,'South Bank', -27.4766, 153.0222),
(6,'Toowoomba', -27.5598, 151.9507),
(7,'Brisbane Airport', -27.3942, 153.1218),
(8,'Suncorp Stadium', -27.4648, 153.0095);


Being based on PostgreSQL, Redshift uses similar syntax to load the sample data:

create table geo.bne_points(
id integer,
name varchar,
latitude double precision,
longitude double precision
);

insert into geo.bne_points(id, name, latitude, longitude)
values  (1,'Brisbane GPO', -27.4679, 153.0279),
(2,'The Gabba', -27.4858, 153.0381),
(3,'Mt Cootha', -27.4678, 152.9390),
(4,'Surfers Paradise', -28.0018, 153.4284),
(5,'South Bank', -27.4766, 153.0222),
(6,'Toowoomba', -27.5598, 151.9507),
(7,'Brisbane Airport', -27.3942, 153.1218),
(8,'Suncorp Stadium', -27.4648, 153.0095);


The following query will order the points of interest using the relative distance formula:

${d}_{relative}={\mathrm{\left({x}_{2}-{x}_{1}\right)}}^{2}+{\mathrm{\left({y}_{2}-{y}_{1}\right)}}^{2}$

select id, name, p1.latitude, p1.longitude,
sqrt( -- square root is optional
power(p2.latitude - p1.latitude, 2) +
power(p2.longitude - p1.longitude, 2)
) as d_relative
from (
select id, name, latitude, longitude
from geo.bne_points
) p1
join (
select -27.4679 as latitude, 153.0279 as longitude
) p2 on true -- append BNE GPO to all rows
order by d_relative
limit 100;


This method results in identical ordering to the PostGIS benchmark when using the sample dataset:

ID Name Latitude Longitude Distance (relative) Redshift Relative Distance Ordering (to Brisbane GPO) 5 South Bank -27.4766 153.0222 0.0104 8 Suncorp Stadium -27.4648 153.0095 0.01866 2 The Gabba -27.4858 153.0381 0.0206 3 Mt Cootha -27.4678 152.9390 0.0889 7 Brisbane Airport -27.3942 153.1218 0.1194 4 Surfers Paradise -28.0018 153.4284 0.6674 6 Toowoomba -27.5598 151.9507 1.0811

### Nearest Neighbour?

Proximity ordering can easily be adapted to find the nearest neighbouring point for all points of interest in the sample dataset.

Unfortunately, this is an exponential computation, but should perform fine for small datasets. Converting to a view or parameterised query may provide a suitable alternative for larger datasets, where computing the entire result-set isn't required and where lazy-evaluation is acceptible.

select distinct id, name,
first_value(p2_name) over (
partition by id
order by d_rel
rows between unbounded preceding and unbounded following
) as nearest_to
from (
select p1.id, p1.name, p1.latitude, p1.longitude,
p2.name as p2_name,
sqrt(
power(p2.latitude - p1.latitude, 2) +
power(p2.longitude - p1.longitude, 2)
) as d_rel
from (
select id, name, latitude, longitude
from geo.bne_points
) p1
join (
select id, name, latitude, longitude
from geo.bne_points
) p2 on p1.id != p2.id
)
order by id
limit 100;


The sample query uses the first_value window function to select the first neighbour for each id, when sorted by the relative distance to that point (d_rel).

The results of this query are provided in the table below:

ID Name Nearest To Redshift Nearest Neighbour (first_value) 1 Brisbane GPO South Bank 2 The Gabba South Bank 3 Mt Cootha Suncorp Stadium 4 Surfers Paradise The Gabba 5 South Bank Brisbane GPO 6 Toowoomba Mt Cootha 7 Brisbane Airport Brisbane GPO 8 Suncorp Stadium South Bank

Based on these results, South Bank would be the best place to meet for a group of people travelling from all of the points in the sample dataset.

If you have any queries or feedback related to this article, or would like some assistance with your data project, please Get in Touch to discuss.