Emulating Geospatial Proximity Ordering in Amazon Redshift

July 24, 2017 Posted by: Rob Turner Tags: Analytics, Geospatial, Redshift, Amazon AWS
Pythagoras

Amazon Redshift is a massively-scalable data warehouse platform that works well for the ad-hoc analysis demanded by business intelligence. Unfortunately, it lacks geospatial features that you expect from an enterprise SQL database engine.

Python UDF support provides an avenue for users to manually implement geospatial functions, though this approach requires knowledge of geospatial libraries and/or complex mathematical equations. It also results in queries that suffer from performance issues, impacting scalability.

Proximity search

One function that is quite useful in geospatial analytics involves matching, ordering or filtering data based on proximity from a given point of interest, or a set of neighbouring points in the dataset.

Putting aside the fact that the earth is not flat, or round, or even a perfect ellipsoid, and that geospatial coordinates are typically represented in latitude and longitude (ie WSG84), an approximation between (relatively) close points is typically good enough for most real world use-cases. We can then avoid complex tensor math and put aside our geospatial datum.

In most scenarios, the exact distance between points is not required and a relative figure is an adequate comparator. In many use-cases a small margin of error is also quite acceptable.

Pythagoras to the rescue!

If you remember back to trigonometry class, you'll know that the distance between two Cartesian points can be calculated using the distance formula, which is derived from Pythagorean theorem.

Basic distance between two points:

D = x2 - x1 2 + y2 - y1 2 where: P1= x1 y1 , P2= x2 y2

Furthermore, if we are only interested in a relative distance for ordering results, the difference between latitude and longitude will provide enough information for a comparison. This also means that the square-root function is no longer required:

drelative = x2 - x1 2 + y2 - y1 2

If slightly better performance is more important than accuracy, we can further optimise the calculation by replacing squares with absolute values. However, when tested with small datasets, this optimisation resulted in very little performance improvement (measured using explain).

dapprox = | x2 - x1 | + | y2 - y1 |

How does this formula translate to SQL code? Below are some tips to implement this approach in Amazon Redshift.

Redshift Implementation

The following table contains a handful of points of interest, which we wish to order by distance from the Brisbane General Post Office (GPO).

IDNameLatitudeLongitude
1Brisbane GPO-27.4679153.0279
2The Gabba-27.4858153.0381
3Mt Cootha-27.4678152.9390
4Surfers Paradise-28.0018153.4284
5South Bank-27.4766153.0222
6Toowoomba-27.5598151.9507
7Brisbane Airport-27.3942153.1218
8Suncorp Stadium-27.4648153.0095

Using PostGIS as a benchmark

Using PostGIS, we benchmark the distance from each point to the Brisbane GPO via the ST_Distance_Spheroid function. The results are listed in the below table, ordered from the closest point to the furthest.

PostGIS Spheroid Distance (WSG 84) to Brisbane GPO
IDNameLatitudeLongitude km
1Brisbane GPO-27.4679153.02790.0
5South Bank-27.4766153.02221.12
8Suncorp Stadium-27.4648153.00951.85
2The Gabba-27.4858153.03812.22
3Mt Cootha-27.4678152.93908.79
7Brisbane Airport-27.3942153.121812.36
4Surfers Paradise-28.0018153.428471.13
6Toowoomba-27.5598151.9507106.91
create table geo.bne_points (
	id serial not null
		constraint bne_points_pkey primary key,
	name varchar,
	latitude double precision,
	longitude double precision
);
create unique index bne_points_id_uindex on bne_points(id);

insert into	geo.bne_points(id, name, latitude, longitude)
values	(1,'Brisbane GPO', -27.4679, 153.0279),
		(2,'The Gabba', -27.4858, 153.0381),
		(3,'Mt Cootha', -27.4678, 152.9390),
		(4,'Surfers Paradise', -28.0018, 153.4284),
		(5,'South Bank', -27.4766, 153.0222),
		(6,'Toowoomba', -27.5598, 151.9507),
		(7,'Brisbane Airport', -27.3942, 153.1218),
		(8,'Suncorp Stadium', -27.4648, 153.0095);
											

Loading into Redshift and ordering

Being based on PostgreSQL, Redshift uses similar syntax to load the sample data:

create table geo.bne_points(
	id integer,
	name varchar,
	latitude double precision,
	longitude double precision
);

insert into geo.bne_points(id, name, latitude, longitude)
values  (1,'Brisbane GPO', -27.4679, 153.0279),
        (2,'The Gabba', -27.4858, 153.0381),
        (3,'Mt Cootha', -27.4678, 152.9390),
        (4,'Surfers Paradise', -28.0018, 153.4284),
        (5,'South Bank', -27.4766, 153.0222),
        (6,'Toowoomba', -27.5598, 151.9507),
        (7,'Brisbane Airport', -27.3942, 153.1218),
        (8,'Suncorp Stadium', -27.4648, 153.0095);
											

The following query will order the points of interest using the relative distance formula:

drelative = x2 - x1 2 + y2 - y1 2

select id, name, p1.latitude, p1.longitude,
  sqrt( -- square root is optional
    power(p2.latitude - p1.latitude, 2) + 
    power(p2.longitude - p1.longitude, 2)
  ) as d_relative
from (
  select id, name, latitude, longitude
  from geo.bne_points
) p1
join (
  select -27.4679 as latitude, 153.0279 as longitude
) p2 on true -- append BNE GPO to all rows
order by d_relative
limit 100;
											

This method results in identical ordering to the PostGIS benchmark when using the sample dataset:

Redshift Relative Distance Ordering (to Brisbane GPO)
IDNameLatitudeLongitude Distance (relative)
1Brisbane GPO-27.4679153.02790.0000
5South Bank-27.4766153.02220.0104
8Suncorp Stadium-27.4648153.00950.01866
2The Gabba-27.4858153.03810.0206
3Mt Cootha-27.4678152.93900.0889
7Brisbane Airport-27.3942153.12180.1194
4Surfers Paradise-28.0018153.42840.6674
6Toowoomba-27.5598151.95071.0811

Nearest Neighbour?

Proximity ordering can easily be adapted to find the nearest neighbouring point for all points of interest in the sample dataset.

Unfortunately, this is an exponential computation, but should perform fine for small datasets. Converting to a view or parameterised query may provide a suitable alternative for larger datasets, where computing the entire result-set isn't required and where lazy-evaluation is acceptible.

select distinct id, name,
  first_value(p2_name) over (
    partition by id
    order by d_rel
    rows between unbounded preceding and unbounded following
  ) as nearest_to
from (
  select p1.id, p1.name, p1.latitude, p1.longitude,
    p2.name as p2_name,
    sqrt(
      power(p2.latitude - p1.latitude, 2) +
      power(p2.longitude - p1.longitude, 2)
    ) as d_rel
  from (
    select id, name, latitude, longitude
    from geo.bne_points
  ) p1
  join (
    select id, name, latitude, longitude
    from geo.bne_points
  ) p2 on p1.id != p2.id
)
order by id
limit 100;
											

The sample query uses the first_value window function to select the first neighbour for each id, when sorted by the relative distance to that point (d_rel).

The results of this query are provided in the table below:

Redshift Nearest Neighbour (first_value)
IDNameNearest To
1Brisbane GPOSouth Bank
2The GabbaSouth Bank
3Mt CoothaSuncorp Stadium
4Surfers ParadiseThe Gabba
5South BankBrisbane GPO
6ToowoombaMt Cootha
7Brisbane AirportBrisbane GPO
8Suncorp StadiumSouth Bank

Based on these results, South Bank would be the best place to meet for a group of people travelling from all of the points in the sample dataset.

If you have any queries or feedback related to this article, or would like some assistance with your data project, please Get in Touch to discuss.