If you have a query that sorts by a derived field, and then takes a limited number of the results, it can be a real dog.  Here’s how I optimized a situation like this.  Imagine this table.

create table office_building (
id int primary key,
latitude float not null,
longitude float not null,
rent int,
address varchar(20),
picture_url varchar(255)
);

If you want to find the nearest 100 office buildings to a point on a map, you run a query something like this (plug your lat/lng into the question marks):

explain select *, round( sqrt( ( ( (latitude - ?) * (latitude - ?) ) *  69.1 * 69.1) +
((longitude - ?) * (longitude - ?) * 53 * 53 ) ) ) as distance
from office_building order by distance limit 100

(See here for an explanation of the 69.1 and 53 constants–basically they convert roughly from lat/lng to miles.) Unfortunately, you are ordering by a derived field, and mysql can no longer do order by optimization.

This means that you’ll be doing a filesort (which does not actually have anything to do with the filesystem, but is just a sort not on an index).  And this, in turn means that your performance will suck if you have any large number of rows returned.

You can help things out a bit by limiting your office building query to a box of a certain size around the point.  Here’s the query with a 5 mile box:

select *, round( sqrt( ( ( (latitude - ?) * (latitude - ?) ) *  69.1 * 69.1 ) +
( (longitude - ?) * (longitude - ?) * 53 * 53 ) ) ) as distance
from office_building
where latitude < ?  + (1/69.1)*5 and latitude > ? - (1/69.1)*5 and longitude < ? + (1/53)*5 and longitude > ? - (1/53)*5
order by distance limit 100

But if you still have too many results, the sorting on distance will be slow.  Also, even if you have an index on latitude and longitude, (such as create index idx_nearby on office_building (latitude,longitude)) because you are not using equality, only the first column will be used.

This is worth repeating, because it took me a while to understand.  If you have an index: create index idx on tbl (col1,col2,col3,col4,col5) and you run a query like select count(*) from tbl where col1 = 1 and col2 > 2 and col3 < 3 and col4 > 4 only col1 and col2 will be used from the index.  Mysql goes to the table data files for col3 and beyond (assuming no other indices on the table).  This makes sense when you think about how indices are created and stored, but I didn’t really understand it until I’d been beaten over the head with it.

As stated here: “[mysql] will use the fields [in the index], from left to right, as long as the WHERE clause has “=”. Once it hits a ‘range’ (>, IN, BETWEEN, …), it stops with that field.”  I don’t know why it is not in the mysql index documentation–maybe it is obvious?

The solution I found was to separate what I wanted to find in the select clause from how I find it, in the where and order by clause:

select select_clause.*,
round( sqrt( ( ( where_clause.latitude - ?) * (where_clause.latitude - ? ) *  69.1 * 69.1 ) +
( (where_clause.longitude - ? ) *(where_clause.longitude - ? ) * 53 * 53 ) ) ) as distance
from office_building where_clause, office_building select_clause
where where_clause.latitude < ? + (1/69.1)*5 and where_clause.latitude > ? - (1/69.1)*5
and where_clause.longitude < ? + (1/53)*5 and where_clause.longitude > ? - (1/53)*5
and where_clause.id = select_clause.id
order by distance
limit 100

You also need to add an index:

create index idx_nearby on office_building (latitude,longitude,id);

Then, when you run the query, you still have the filesort, but you also see the magic ‘Using index’ in your explain plan.  You never have to go to the table to do the sort!  You also have a join now, but it’s on the primary key, and you only need to go to the table for the 100 rows that you know you want.

Using this query had an effect on one live system of one to two orders of magnitude increase in query speed, depending on the query.  This not only works for distance queries, but anytime you want to order by a calculated value.

More useful links: geo search suggestions, index explanation

Technorati Tags: , ,

2 thoughts on “Optimizing a distance calculation in a mysql query

  1. Brian Timoney says:

    Dan:

    Spatial calculations in MySQL make me wince, since a fully robust geospatial implementation is one of the great features of PostgreSQL/PostGIS. (Yeah, I know, MySQL has some spatial functions, but it’s a limited implementation).

    Anyway, it’s fairly straightforward to create a POINT geometry object from lat/long columns. The key is to then create a spatial index on the geometry column–without it the spatial comparisons (e.g. intersection tests) are super slow.

    Finally, I’ll freely admit that cheap web hosting with PostgreSQL/PostGIS is not abundant, but I’ve had decent success with A2Hosting.

    In short, if you’re doing non-trivial spatial operations, or you need to bring linear or area features (polygons) into the mix, one should take a serious look at PostGIS.

    BT

  2. moore says:

    Brian,

    Agreed that if you know you need to do a lot of spatial operations, PostGIS is a better starting point. However, if you only need to know the nearest 100 points to a given lat/lng and have an existing investment in mysql, these contortions make sense.

    BTW, I haven’t even looked at the MySQL spatial extensions: http://dev.mysql.com/doc/refman/5.0/en/spatial-extensions.html
    agan, because of the extremely limited GIS needs of the situation.

Comments are closed.


© Moore Consulting, 2003-2019