Geospatial Proximity Ordering - Amazon Redshift

Amazon Redshift is a massively-scalable data warehouse platform that works well for the ad-hoc analysis demanded by business intelligence. Unfortunately, it lacks geospatial features that you expect from an enterprise SQL database engine.

Python UDF support provides an avenue for users to manually implement geospatial functions, though this approach requires knowledge of geospatial libraries and/or complex mathematical equations. It also results in queries that suffer from performance issues, impacting scalability.

Proximity search

One function that is quite useful in geospatial analytics involves matching, ordering or filtering data based on proximity from a given point of interest, or a set of neighbouring points in the dataset.

Putting aside the fact that the earth is not flat, or round, or even a perfect ellipsoid, and that geospatial coordinates are typically represented in latitude and longitude (ie WSG84), an approximation between (relatively) close points is typically good enough for most real world use-cases. We can then avoid complex tensor math and put aside our geospatial datum.

In most scenarios, the exact distance between points is not required and a relative figure is an adequate comparator. In many use-cases a small margin of error is also quite acceptable.

Pythagoras to the rescue!

If you remember back to trigonometry class, you'll know that the distance between two Cartesian points can be calculated using the distance formula, which is derived from Pythagorean theorem.

Basic distance between two points:

D = \sqrt{{(x_{2} - x_{1})}^{2} + {(y_{2} - y_{1})}^{2}} where: P_{1} = (x_{1}, y_{1}), P_{2} = (x_{2}, y_{2})

Furthermore, if we are only interested in a relative distance for ordering results, the difference between latitude and longitude will provide enough information for a comparison. This also means that the square-root function is no longer required:

$d_{relative} = {(x_{2} - x_{1})}^{2} + {(y_{2} - y_{1})}^{2}$

If slightly better performance is more important than accuracy, we can further optimise the calculation by replacing squares with absolute values. However, when tested with small datasets, this optimisation resulted in very little performance improvement (measured using explain).

$d_{approx} = | x_{2} - x_{1} | + | y_{2} - y_{1} |$

How does this formula translate to SQL code? Below are some tips to implement this approach in Amazon Redshift.

Redshift Implementation

The following table contains a handful of points of interest, which we wish to order by distance from the Brisbane General Post Office (GPO).

ID	Name	Latitude	Longitude
1	Brisbane GPO	-27.4679	153.0279
2	The Gabba	-27.4858	153.0381
3	Mt Cootha	-27.4678	152.9390
4	Surfers Paradise	-28.0018	153.4284
5	South Bank	-27.4766	153.0222
6	Toowoomba	-27.5598	151.9507
7	Brisbane Airport	-27.3942	153.1218
8	Suncorp Stadium	-27.4648	153.0095

Using PostGIS as a benchmark

Using PostGIS, we benchmark the distance from each point to the Brisbane GPO via the ST_Distance_Spheroid function. The results are listed in the below table, ordered from the closest point to the furthest.

ID	Name	Latitude	Longitude	km
PostGIS Spheroid Distance (WSG 84) to Brisbane GPO
1	Brisbane GPO	-27.4679	153.0279	0.0
5	South Bank	-27.4766	153.0222	1.12
8	Suncorp Stadium	-27.4648	153.0095	1.85
2	The Gabba	-27.4858	153.0381	2.22
3	Mt Cootha	-27.4678	152.9390	8.79
7	Brisbane Airport	-27.3942	153.1218	12.36
4	Surfers Paradise	-28.0018	153.4284	71.13
6	Toowoomba	-27.5598	151.9507	106.91

create table geo.bne_points (
	id serial not null
		constraint bne_points_pkey primary key,
	name varchar,
	latitude double precision,
	longitude double precision
);
create unique index bne_points_id_uindex on bne_points(id);

insert into	geo.bne_points(id, name, latitude, longitude)
values	(1,'Brisbane GPO', -27.4679, 153.0279),
		(2,'The Gabba', -27.4858, 153.0381),
		(3,'Mt Cootha', -27.4678, 152.9390),
		(4,'Surfers Paradise', -28.0018, 153.4284),
		(5,'South Bank', -27.4766, 153.0222),
		(6,'Toowoomba', -27.5598, 151.9507),
		(7,'Brisbane Airport', -27.3942, 153.1218),
		(8,'Suncorp Stadium', -27.4648, 153.0095);

Loading into Redshift and ordering

Being based on PostgreSQL, Redshift uses similar syntax to load the sample data:

create table geo.bne_points(
	id integer,
	name varchar,
	latitude double precision,
	longitude double precision
);

insert into geo.bne_points(id, name, latitude, longitude)
values  (1,'Brisbane GPO', -27.4679, 153.0279),
        (2,'The Gabba', -27.4858, 153.0381),
        (3,'Mt Cootha', -27.4678, 152.9390),
        (4,'Surfers Paradise', -28.0018, 153.4284),
        (5,'South Bank', -27.4766, 153.0222),
        (6,'Toowoomba', -27.5598, 151.9507),
        (7,'Brisbane Airport', -27.3942, 153.1218),
        (8,'Suncorp Stadium', -27.4648, 153.0095);

The following query will order the points of interest using the relative distance formula:

$d_{relative} = {(x_{2} - x_{1})}^{2} + {(y_{2} - y_{1})}^{2}$

select id, name, p1.latitude, p1.longitude,
  sqrt( -- square root is optional
    power(p2.latitude - p1.latitude, 2) + 
    power(p2.longitude - p1.longitude, 2)
  ) as d_relative
from (
  select id, name, latitude, longitude
  from geo.bne_points
) p1
join (
  select -27.4679 as latitude, 153.0279 as longitude
) p2 on true -- append BNE GPO to all rows
order by d_relative
limit 100;

This method results in identical ordering to the PostGIS benchmark when using the sample dataset:

ID	Name	Latitude	Longitude	Distance (relative)
Redshift Relative Distance Ordering (to Brisbane GPO)
1	Brisbane GPO	-27.4679	153.0279	0.0000
5	South Bank	-27.4766	153.0222	0.0104
8	Suncorp Stadium	-27.4648	153.0095	0.01866
2	The Gabba	-27.4858	153.0381	0.0206
3	Mt Cootha	-27.4678	152.9390	0.0889
7	Brisbane Airport	-27.3942	153.1218	0.1194
4	Surfers Paradise	-28.0018	153.4284	0.6674
6	Toowoomba	-27.5598	151.9507	1.0811

Nearest Neighbour?

Proximity ordering can easily be adapted to find the nearest neighbouring point for all points of interest in the sample dataset.

Unfortunately, this is an exponential computation, but should perform fine for small datasets. Converting to a view or parameterised query may provide a suitable alternative for larger datasets, where computing the entire result-set isn't required and where lazy-evaluation is acceptible.

select distinct id, name,
  first_value(p2_name) over (
    partition by id
    order by d_rel
    rows between unbounded preceding and unbounded following
  ) as nearest_to
from (
  select p1.id, p1.name, p1.latitude, p1.longitude,
    p2.name as p2_name,
    sqrt(
      power(p2.latitude - p1.latitude, 2) +
      power(p2.longitude - p1.longitude, 2)
    ) as d_rel
  from (
    select id, name, latitude, longitude
    from geo.bne_points
  ) p1
  join (
    select id, name, latitude, longitude
    from geo.bne_points
  ) p2 on p1.id != p2.id
)
order by id
limit 100;

The sample query uses the first_value window function to select the first neighbour for each id, when sorted by the relative distance to that point (d_rel).

The results of this query are provided in the table below:

ID	Name	Nearest To
Redshift Nearest Neighbour (first_value)
1	Brisbane GPO	South Bank
2	The Gabba	South Bank
3	Mt Cootha	Suncorp Stadium
4	Surfers Paradise	The Gabba
5	South Bank	Brisbane GPO
6	Toowoomba	Mt Cootha
7	Brisbane Airport	Brisbane GPO
8	Suncorp Stadium	South Bank

Based on these results, South Bank would be the best place to meet for a group of people travelling from all of the points in the sample dataset.

If you have any queries or feedback related to this article, or would like some assistance with your data project, please Get in Touch to discuss.

Redshift Analytics :: Geospatial Proximity Ordering

Emulating Geospatial Proximity Ordering in Amazon Redshift