DynamoDb: What’s left to manage

AWS recently announced that DynamoDb will now scale read and write capacity automatically.  While there was already a lot of database administration that DynamoDb took care of (backups, underlying infrastructure provisioning), setting the proper capacities initially, and updating them as your application changed, was a key task that fell to the user. No more.

I posted a link to the news to a discussion channel I participate in, and someone asked: “what’s left to manage?”. Drawing from that discussion, here are a few items remaining:

  • Appropriate partition keys.  Make sure they are spread uniformly.
  • Choosing the right primary key. Since you typically want to avoid table scans and can only query by primary key, making sure you pick the right one is important.  (I would also call this “data model design”.)
  • Enforcing data integrity, initially and through time.  This is a challenge with every nosql solution.
  • Creating the appropriate secondary global indices for your application.
  • Securing and controlling access to your data.

These are all still important tasks, but DynamoDb is getting easier and easier to use for high performance applications for which nosql is a good fit.  (And for which you don’t mind being tied to AWS.)

Restoring a single table from an Amazon RDS backup

material-icon-1307676_640When you use SQL, how do you write delete statements at the database prompt?

A delete statement typically looks like this: delete from table_name where column_name = 'foo';. I usually write it in this order:

  1. delete
  2. delete where column_name = 'foo';
  3. delete from table_name where column_name = 'foo';

Even though this is a pain because you have to move back and forth (I really need to look into vi keybindings for mysql), it prevents you from making sending this command by accident: delete from table_name; which deletes all the data in your table.  (Another alternative is to never use the interactive client and always write out your delete statements in a file and run that file to delete data.)

But, recently, I did exactly that, because I forgot.  I deleted all the data from one table in our production database.  It was billing data, so rather important.  Luckily, I am using Amazon RDS and had set up backup retention.

I wanted to outline what I did to recover from this.

  • I took a deep breath.
  • I wrote a message on the slack channel documenting what had happened and the possible customer impact.
  • Depending on which data is removed, it’s possible you will want to put the application in maintenance mode and/or inform your customers of the issues.  What I deleted was used rarely enough that I didn’t have to take these steps.
  • I looked at how to restore an Amazon RDS backup.
  • I restored the missing data.
  • I communicated that things were back to normal to internal stakeholders.

Unfortunately, it wasn’t clear how to restore a single table.  I’m used to being able to download a .sql file and hand edit it, but that’s not an option.  Stackoverflow wasn’t super helpful.   But if there’s anytime you want clarity, it’s when you are restoring production data.  You don’t want to compound the problem by screwing up something else.

So, here’s how to restore a single table from an Amazon RDS backup:

  • Note the time just before you deleted the data.  (Another reason the slack message is nice.  chatops ftw.)
  • Start up another instance from that moment.  I named it something obvious like ‘has-data-from-tablename’.
  • Twiddle your thumbs anxiously while the new instance starts up.
  • The instance is put into your default security group (as of this writing) which probably doesn’t allow mysql access.  Make sure you modify this security group to allow access.
  • When the instance is up, do a dump of the table you need: mysqldump -t --ssl-ca=./amazon-rds-ca-cert.pem -u user -ppassword -h has-data-from-tablename.c1m7x25w24qor.us-east-1.rds.amazonaws.com -P3306 database_name tablename > restore-table_name.sql; (-t omits the create database/table statements.)
  • If your table is has had writes since you deleted everything, you may need to manually pull down the current data from the production system and merge it into restore-table_name.sql; I was able to avoid this step.
  • Load the data using mysql mysql --ssl-ca=./amazon-rds-ca-cert.pem -u user -ppassword -h production.c1m7x25w24qor.us-east-1.rds.amazonaws.com -P3306 database_name < restore-table_name.sql;
  • Review to make sure the data is correct.
  • Test the application.
  • Update the slack channel, and do any other notifications you need to (customers, internal contacts, etc).
  • Revoke the default security group access you allowed above.
  • Delete the ‘has-data-from-tablename’ instance.

Note this only works if you caught your mistake within the backup retention window. (Make sure you set that up.)  We aren’t multi AZ or clustered, so I’m not sure how that would affect things.

Happy deep breathing!

Why Use an ETL Tool?

transformation photo

Photo by AlicePopkorn

I’m a big fan of ETL tools.  The one with which I am most familiar is Kettle, aka Pentaho Data Integration.  When I was working for 8z, we used it heavily to pull data from other systems, process it, and update our databases.  While ETL systems are not without their flaws, I think their strengths are such that everyone who is moving data around should consider them.  This is more true now than in the past because there is a lot more data flowing everywhere, and there are several viable open source ETL tools, so you don’t have to spend thousands or tens of thousands of dollars to get started.

What are the benefits of ETL tools?

  • There are pre-built components for common data tasks (connecting to a database, parsing a flat file) that have been tested and debugged by many many people.  It’s hard to over emphasize how much time this can save, allowing you to focus on business logic.
  • You operate at a higher level of abstraction.
  • There is support for other performance features like parallel jobs that you can configure.
  • The GUI makes data flow obvious.
  • You can write your own components that leverage existing libraries.

What are the detriments?

  • Possible to version control, impossible to merge.
  • Limits of components mean you sometimes have to contort your data flows, or drop down to write your own component.
  • Some components (at least for Kettle) are not open source.
  • You have to roll your own testing framework.  I did.
  • You have to learn another tool.

Don’t re-invent the wheel!  Your data movement problem may very well be a super special snowflake, but chances are it isn’t.  Every line of code you write is another you have to maintain.  When you are confronted with a data movement problem, take a look at an ETL tool like Kettle and see if you can stand on the shoulders of giants.  Here’s a list of open source ETL tools to evaluate.

Cassandra Dev Day Denver

data photo

Photo by justgrimes

Most of my datastore experience is with RDBMS like mysql, oracle and postgresql (though I did work with some key value stores like berkleydb back in the day). So when a full day, free intro to Cassandra was offered, I jumped on it, even though it was in Centennial. You can view the schedule, speakers and talk synopses for the day. There were two tracks, beginner and advanced. Since I didn’t know anything about Cassandra, I followed the beginner track.

First, though, it was amazing how many people were there. The two main companies behind the talk, Pearson Education and DataStax, a vendor providing a commercial, supported version of Cassandra, ended up having to provide two overflow rooms, and it was still standing room only for some of the talks. Quite a nice turnout, and I think the sponsors were pleasantly shocked. I was also surprised by the number of folks from Boulder. I happened to sit next to two folks from Westminster and Superior, and ended up having a common friend or colleague with each. Small world.

I learned a ton about Cassandra, from its internals, to its topology (the ring’s the thing) to abstractions that let you query it (CQL, which is a subset of SQL) to data modeling to using the java driver, which makes accessing Cassandra almost as easy as using JDBC. While there are some SQL concepts that appear to map fairly well to Cassandra, I put quotes around them below to remind myself of the fact that a Cassandra ‘table’ isn’t the same as a RDBMS table, ditto for ‘row’, ‘primary key’ and other important concepts.

I think the biggest takeaway for me was that Cassandra is a “write many, read once” system. Because you can only query efficiently on one or two keys, if you have multiple queries, you want to write the data multiple times in a denormalized system, one ‘table’ for each query. Because of this, Cassandra shines in use cases where you are doing a lot of inserts, have known queries, and need speed and availability (sensor data was mentioned several times).

How does this actually work? Here’s an example (as best I understand it–here are some others from people who actually have experience using this technology):

If you have click stream data, in standard apache format, and you want to be able to have it stored in a database and highly available for a few specific queries, Cassandra might be a good choice. Here’s a line of my clickstream, from my blog: - - [15/Oct/2014:08:03:30 -0600] "GET /wordpress/archives/date/2007/07 HTTP/1.1" 200 66258 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

This is time based data, and has some valuable information, and some not so valuable information. What things might I be interested in querying on? Well, I might care about the user agent, request time, ip address, status code, or URL path requested. I probably don’t care about the HTTP method, HTTP version or the bytes served. But, for the sake of this example, let’s say my application needs to show the location of the most recent 1000 users for a given country, for a fancy widget for my website. I will use maxmind or a similar service for mapping ip addresses to country. That’s all I care about. (Yes, I know this is contrived–I had to revise this example a couple of times to make it fit with Cassandra’s usage model.) I would set up in this ‘table’ in Cassandra.

country text,
time timestamp,
user_agent text,
status_code int,
path text,
PRIMARY KEY (text, time)

In this table, country is the partition key, and time is the clustering key. That means that this query: select location from location where country = 'USA' limit 1000; will be screaming fast. If I wanted to look at paths requested by time, I would not add an index to this table, but instead create another whole table, say request_path. Then my insertion code would write to both location and request_path. And then clients that wanted path information would use the specific table. Yup, denormalization is the name of the game.

This means that Cassandra has certain specific use cases, and that trying to use it as a general purpose data storage and query engine is foolish. Several presenters mentioned that Cassandra plus Apache Spark for general queries was a good solution.

Denormalization as standard operating procedure isn’t the only mind bending facet of Cassandra. Others:

  • the presenters also talked about the high availability and replication of Cassandra–you can actually configure it to be data center aware so that it automatically replicates data across different data centers.
  • For each keyspace (a set of tables, so similar to a schema), you specify how a replication factor–how many times each piece of data is stored.
  • For every read or write, you specify how many nodes must either agree or accept the data, respectively.
  • Adding nodes is the preferable way to deal with scale. You can add a node easily, and Cassandra will auto partition data and spread existing data across the new node.
  • The biggest Cassandra setup they mentioned was 75K nodes.
  • Each ‘row’ can have up to 2 billion records. If a row stretches across nodes, you’ll kill performance.
  • There’s a process called ‘compaction’ which is similar to Java’s garbage collection, and just like GC, you have to pay attention to how compaction works, because it will affect performance.

All in all, very interesting day, and I appreciated the experience. One more (interesting) tool to add to the toolbox.

Java REST API Framework Options

resting photo

Photo by shioshvili

I’ve been working with a couple of REST API solutions that exist in the Java tech stack.  I haven’t seen any great analysis of REST API solutions (though Matt Raible does mention some in this exhaustive slide deck about Java frameworks [pdf]), so wanted to share my on the ground experience.

First up is restSQL.  This framework makes it easy to get data from a database to a JSON or XML REST API and back.  If you have a servlet container available, you write two configuration files, one with a SQL query and one with db connection information, and you have a RESTful API.  For prototyping and database access, it is hard to beat.


  • Quick to set up
  • Only SQL knowledge is required
  • No programming required
  • Allows simple mapping of db table to resource, but can include one to one and one to many mappings
  • Supports all four REST operations out of the box
  • Supports XML as well as JSON
  • Is an embeddable java library as well as a standalone framework
  • Project maintainer is engaged and the project is moving forward


  • Requires a servlet engine, and you have to restart it for changes to your configuration to be picked up
  • Output format has limited customization
  • Only works with mysql and postgresql databases (though there is some experimental support for Oracle and MS SQL)
  • Doesn’t work with views
  • The security model, while fine grained, isn’t modern/OAuth (can be solved with an API gateway (like 3scale, Tyk or ApiAxle) or proxy

The next framework I have experience with is Dropwizard.  This is a powerful framework that creates uberjars that you can run on any port as a standalone service.  It’s not limited to providing a JSON representation of database tables–if you can create a Java object, Dropwizard can serve it up as a JSON resource.


  • Community support
  • Extreme output formatting flexibility, but be prepared to write a custom deserializer if you want to handle anything other than reads of custom formatted objects
  • Supports any database that hibernate supports
  • Built in testing support
  • Brings together ‘best of breed’ tools like Jersey, Jackson and Hibernate, so you don’t have to do the integration yourself
  • Great documentation


  • Have to roll your own deployment solution (tarball, chef, puppet)
  • No services startup script provided
  • Shading can slow down development
  • Not yet at 1.0 release

The last one I don’t have familiarity with, but a colleague used it in the past.  It is Sparkjava.  This is a lightweight framework that fits when you have an existing Java library with functionality you want to expose.  I’m not competent to write pros/cons for this framework, but wanted to mention it.

The gorilla in the room that I haven’t had experience with (in terms of writing RESTful webs services) is Spring.  I would definitely include this in any greenfield solutions review.

Small scale data migrations

So, I’ve recently been involved in another data migration, the second one in three years.  These are small migrations, with thousands records.  One person to take care of this size of data migration with some effort, but the amount of data is still large enough that manual data re-entry isn’t really an option–the error rate and the cost and the management difficulty mean that software is the better option.

Here are some lessons I learned from these data migrations.

Learn as much as you can about the data models–both the old and the new–as you can.  This includes, in preferred order, talkingto any people familiar with the old system, talking to any people familiar with the new system, looking at the databases via a sql client, reading documentation (if any is written), and looking at code.  I spent some time thrashing around in old system code for a while.  Then I asked the developer for a tour, and learned more in that hour than I had in the previous day of looking at code.

Map entities and concepts as early as you can.  Take special note of any that are in the old and not in the new (and what you are planning to do with them).  Those that are in the new and not in the old aren’t as big of an issue.  Also, attributes of entities are as important as entities, so note discrepancies there.  Early on I noticed that one of the two primary entities in the old system did not exist in the new system.  This led to some interesting conversations with the business users that saved me work.

As above, talk to people who are going to be using the new system, and who use the old system, throughout the migration process.  An entity or attribute that will be a royal pain to migrate may not be used anymore!  Or, the business person might have some good ideas on how to map something in the old system into the new system.  Someone who uses the software you are migrating has more domain expertise than you.  Let them try the new system with migrated data as soon as some data is moved. Make sure to guide their experience so they don’t spin their wheels looking in corners of the system that not yet migrated.

Start a spreadsheet of tasks, so that everytime you uncover something that needs to be done while you are in the process of doing something else, you can note it and keep on your original task.  My spreadsheets are simple; three columns are enough: task name, completed (with an X for completion, blank for still open) and notes (for possible implementation solutions, people to talk to, relevant urls, or any other text that will help me complete the task).

Document all the migration steps, preferably to the point you can cut and paste commands.  Include any discrepancies discovered, special commands to run, access to all needed systems, names of relevant people, areas that need further investigation, and basically anything else you would want handed to you if you were starting on this project.  This helps immensely if you need to pass off the project, or come back to it later (even just a few days), and provides documentation of entities on the old and new system.

Write scripts wherever possible, but don’t try to script the whole process–access to different servers can be hard to automate.  Use whatever language you feel is best for these scripts.  I’ve used bash, sql, perl, and awk/sed, but I don’t shy away from a compiled language like java, especially if a library exists that can save me time.  Make sure to put these scripts into version control, and document the purpose with comments at the top and a good name.  I wouldn’t worry too much about unit testing or refactoring this software, because chances are it will be seldom used once the migration occurs.

Get familiar with the concatenate function of your database.  Using queries to write DDL for the new system based on data from the old system can save you writing a script in an imperative language.  When migrating from Expression Engine to WordPress, I used a statement like <code>select concat(‘update wp_comments set comment_author_email = ”’,email,”’ where comment_author = ”’,name,”’;’) from exp_comments where name in (select distinct(name) from exp_comments);</code> to generate an update statement for WordPress for each comment author in the EE database.

Think about data types and representations.  Especially if you are moving from one database to another.  When I was moving from MSSQL to MySQL, date fields were particularly thorny.

Realize that these types of projects are typically difficult slogs.  There were moments where I despaired of ever getting through the migration in a timely fashion.  To do it right, you need a fantastic attention to detail, an understanding of the business needs, and an ability to drive things through to the finish.  All of this can be pretty draining–I find it far more draining than bug fixing or building new features.

Control the old and new systems–try to not have new capabilities added during the migration.  If you can’t guarantee that, can the migration wait until the new and old systems stabilize?  If not, checkpoint the migration against the new capabilities during the process, and realize that you are introducing a lot of extra work and complexity into an already complex process.

Have a staging system where you can practice your migrations without affecting anyone.  Plan to go through at least two or three of these new staging systems so that you can get the migration steps solid before you touch production.  Start from a clean slate each time so no time is spent chasing phantom bugs from a previous migration that didn’t finish or wasn’t entirely correct.  This is what makes the migration documentation you write so important.  Be aware that the new stage system and the new production system will not necessarily be the same.

Lastly, avoid committing to a schedule if at all possible.  And if you must, pad it and only commit after you’ve done a thorough analysis.  Because there are so many hidey holes and areas of the old system that you won’t understand, there is a high probability that you’ll be discovering new issues and data you need to migrate halfway through the project.  (This is a special case of the requirements nightmare known as ‘build system B that acts exactly like system A’.)  Communicate progress to the business.

While this is not my favorite type of project, when done well it can have tremendous business value.  Combining newer, more flexible systems with rich older data, without re keying the data, can make system users much happier.  In some cases, if there is no migration, the newer system simply can’t be used.

Moving data from MS-SQL to mysql

I recently worked on a project where I needed to port data from a MS-SQL database to a mysql database.  There are programs, both payware and freeware that will help with this process, but I didn’t have ODBC access to the MS-SQL database, which these programs require.  All I had were a bunch of insert statements that looked like this:

[online],[gender]) VALUES(463,N'15',N'7',N'2009',N'9:13:52 AM',N'Homer',
N'234 Main',N'Alaska',N'Springfield',NULL,N'homer@thesimpsons.com'),
CAST(0x9AE90000 AS SmallDateTime),
CAST(0x00009CD700000000 AS DateTime),N'Man');

I wrote a perl script to turn that dialect of SQL into a mysql friendly dialect (feel free to download it).

The most interesting parts were those CAST statements.  This forum post and this blog post helped me turn those casts into real dates.  (After loading the inserts into mysql, I did some post processing, using the helpful case statement and str_to_date function to rationalize some of the data.)

Technorati Tags: , ,

Hibernate Boolean/Integer ClassCastException

I ran into an issue the other day with Hibernate configuration.

I have a bean that maps to a table in the database.  It has a column, featured, that only has values of 0 or 1.  For legacy reasons, we map that to an integer (turning it into a boolean is on the List Of Things To Do).

I ran into an issue, and was getting this error in the logs:

Caused by: java.lang.ClassCastException: java.lang.Boolean cannot be cast to java.lang.Integer
at org.hibernate.type.IntegerType.set(IntegerType.java:41)
at org.hibernate.type.NullableType.nullSafeSet(NullableType.java:136)
at org.hibernate.type.NullableType.nullSafeSet(NullableType.java:116)
at org.hibernate.loader.Loader.bindPositionalParameters(Loader.java:1698)
at org.hibernate.loader.Loader.bindParameterValues(Loader.java:1669)
at org.hibernate.loader.Loader.prepareQueryStatement(Loader.java:1554)
at org.hibernate.loader.Loader.doQuery(Loader.java:661)
at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:224)
at org.hibernate.loader.Loader.doList(Loader.java:2211)
at org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2095)
at org.hibernate.loader.Loader.list(Loader.java:2090)
at org.hibernate.loader.criteria.CriteriaLoader.list(CriteriaLoader.java:94)
at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1569)
at org.hibernate.impl.CriteriaImpl.list(CriteriaImpl.java:283)

Here's the relevant section of the mapping file:

<property name="featured" type="integer" columnn="featured" />

I tried explicitly laying out the type of the SQL column:

<property name="featured" type="integer">
<column sql-type="INTEGER" name="featured" />

But neither of these worked; I received the same error.

I took a step back and looked at the bean.

Integer featured;

public Integer getFeatured() {
return featured;
public void setFeatured(Integer featured) {
this.featured = featured;


and realized that I had added a convenience method that was confusing the Hibernate code:

public boolean isFeatured() {
return ...

When I changed the signature of the convenience method to:

public boolean isFeaturedListing() {
return ...

the exception went away.  Thought I'd share for anyone searching for this stacktrace.

Technorati Tags:

Tips: Deploying a web application to the cloud

I am wrapping up helping a client with a build out of a drupal site to ec2. The site itself is a pretty standard CMS implementation–custom content types, etc. The site is an extension to an existing brand, and exists to collect email addresses and send out email newsletters. It was a team of three technical people (there were some designers and other folks involved, but I was pretty much insulated from them by my client) and I was lucky enough to do a lot of the infrastructure work, which is where a lot of the challenge, exploration and experimentation was.

The biggest attraction of the cloud was the ability to spin up and spin down extra servers as the expected traffic on the site increased or decreased. We choose Amazon’s EC2 for hosting. They seem a bit like the IBM of the cloud–no one ever got fired, etc. They have a rich set of offerings and great documentation.

Below are some lessons I learned from this project about EC2. While it was a drupal project, I believe many of these lessons are applicable to anyone who is building a similar system in the cloud. If you are building an video processing super computer, maybe not so much.

Fork your AMI

Amazon EC2 running instances are instantiations of a machine image (AMI). Anyone can create a machine image and make it available for others to use. If you start an instance off an image, and then the owner of the image deletes the image (or otherwise removes it), your instance continues to run happily, but, if you ever need to spin up a second instance off the same AMI, you can’t. In this case, we were leveraging some of the work done by Chapter Three called Project Mercury. This was an evolving project that released several times while we were developing with it. Each time, there was a bit of suspense to see if what we’d done on top of it worked with the new release.

This was suboptimal, of course, but the solution is easy. Once you find an AMI that works, you can start up an instance, and then create your own AMI from the running instance. Then, you use that AMI as a foundation for all your instances. You can control your upgrade cycle. Unless you are running against a very generic AMI that is unlikely to go away, forking is highly recommended.

Use Capistrano

For remote deployment, I haven’t seen or heard of anything that compares to Capistrano. Even if you do have to learn a new scripting language (Ruby), the power you get from ‘cap’ is fantastic. There’s pretty good EC2 integration, though you’ll want to have the EC2 response XML documentation close by when you’re trying to parse responses. There’s also some hassle involved in getting cap to run on EC2. Mostly it involves making sure the right set of ssh keys is in the correct place. But once you’ve got it up and running, you’ll be happy. Trust me.

There’s also a direct capistrano/EC2 integration project, but I didn’t use that. It might be worth a look too.


If you are doing any kind of database driven website, there’s really no substitute for persistent storage. Amazon’s Elastic Block Storage (EBS) is relatively cheap. Here’s an article explaining setting up MySQL on EBS. I do have a friend who is using EC2 in a different manner that is very write intensive, that is having some performance issues with his database on EBS, but for a write seldom, read often website, like this one, EBS seems plenty fast.

EC2 Persistence

Some of the reasons to use Capistrano are that it forces you to script everything, and makes it easy to keep everything in version control. The primary reason to do that is that EC2 instances aren’t guaranteed to be persistent. While there is an SLA around overall EC2 availability, individual instances don’t have any such assurances. That’s why you should use EBS. But, surprisingly, the EC2 instances that we are using for the website haven’t bounced at all. I’m not sure what I was expecting, but they (between three and eight instances) have been up and running for over 30 days, and we haven’t seen a single failure.

Use ElasticFox

This is a FireFox extension that lets you do every workaday task, and almost every conceivable operation, to your EC2 instances. Don’t delay, use this today.

Consider CloudFront

For distributed images, CloudFront is a natural fit. Each instance can then reference the image, without you needing to sync files across instances. You could use this for other files as well.

Use Internal Network Addressing where possible

When you start an EC2 instance, Amazon assigns it two IP addresses–an external name that can be used to access it from the internet, and an internal name. For most contexts, the external name is more useful, but when you are communicating within the cloud (pushing files around, or a database connection), prefer the internal DNS. It looks like there are some performance benefits, but there are definitely pricing benefits. “Always use the internal address when you are communicating between Amazon EC2 instances. This ensures that your network traffic follows the highest bandwidth, lowest cost, and lowest latency path through our network.” We actually used the internal DNS, but it makes more sense to use the IP address, as you don’t get any abstraction benefits from the internal DNS, which you don’t control–that takes a bit of mental adjustment for me.

Consider reserved instances

If you are planning to use Amazon for hosting, make sure you explore reserved instance pricing. For an upfront cost, you get significant savings on your runtime costs.

On Flexibility

You have a lot of flexibility with EC2–AMIs are essentially yours to customize as you want, starting up another node takes about 5 minutes, you control your own DNS, etc. However, there are some things that are set at startup time. Make sure you spend some time thinking about security groups (built in firewall rules)–they fall into this category. Switching between AMIs requires starting up a new instance. Right now we’re using DNS round robin to distribute load across multiple nodes, but we are planning to use elastic IPs which allow you to remap a routable ip address to a new instance without waiting for DNS timeouts. EBS volumes and instances they attach to must be in the same availability zone. None of these are groundbreaking news, it’s really just a matter of reading all the documentation, especially the FAQs.


Be aware that there are a ton of documentation, one set for each API release, for EC2 and the other web services that Amazon provides. Rather than starting with Google, which often leads you to an outdated version of documentation, you should probably start at the AWS documentation center. This is especially true if you’re working with any of the systems that are newer with perhaps not as stable an API.

In the end

Remember that, apart from new tools and a few catches, using EC2 is not that different than using a managed server where you don’t have access to the hardware. The best document I found on deploying drupal to EC2 doesn’t talk about EC2 at all–it focuses on the architecture of drupal (drupal 5 at that) and how to best scale that with additional servers.

Technorati Tags: , ,

Optimizing a distance calculation in a mysql query

If you have a query that sorts by a derived field, and then takes a limited number of the results, it can be a real dog.  Here’s how I optimized a situation like this.  Imagine this table.

create table office_building (
id int primary key,
latitude float not null,
longitude float not null,
rent int,
address varchar(20),
picture_url varchar(255)

If you want to find the nearest 100 office buildings to a point on a map, you run a query something like this (plug your lat/lng into the question marks):

explain select *, round( sqrt( ( ( (latitude - ?) * (latitude - ?) ) *  69.1 * 69.1) +
((longitude - ?) * (longitude - ?) * 53 * 53 ) ) ) as distance
from office_building order by distance limit 100

(See here for an explanation of the 69.1 and 53 constants--basically they convert roughly from lat/lng to miles.) Unfortunately, you are ordering by a derived field, and mysql can no longer do order by optimization.

This means that you'll be doing a filesort (which does not actually have anything to do with the filesystem, but is just a sort not on an index).  And this, in turn means that your performance will suck if you have any large number of rows returned.

You can help things out a bit by limiting your office building query to a box of a certain size around the point.  Here's the query with a 5 mile box:

select *, round( sqrt( ( ( (latitude - ?) * (latitude - ?) ) *  69.1 * 69.1 ) +
( (longitude - ?) * (longitude - ?) * 53 * 53 ) ) ) as distance
from office_building
where latitude < ?  + (1/69.1)*5 and latitude > ? - (1/69.1)*5 and longitude < ? + (1/53)*5
and longitude > ? - (1/53)*5
order by distance limit 100

But if you still have too many results, the sorting on distance will be slow.  Also, even if you have an index on latitude and longitude, (such as create index idx_nearby on office_building (latitude,longitude)) because you are not using equality, only the first column will be used.

This is worth repeating, because it took me a while to understand.  If you have an index: create index idx on tbl (col1,col2,col3,col4,col5) and you run a query like select count(*) from tbl where col1 = 1 and col2 > 2 and col3 < 3 and col4 > 4 only col1 and col2 will be used from the index.  Mysql goes to the table data files for col3 and beyond (assuming no other indices on the table).  This makes sense when you think about how indices are created and stored, but I didn't really understand it until I'd been beaten over the head with it.

As stated here: "[mysql] will use the fields [in the index], from left to right, as long as the WHERE clause has "=". Once it hits a 'range' (>, IN, BETWEEN, ...), it stops with that field."  I don't know why it is not in the mysql index documentation--maybe it is obvious?

The solution I found was to separate what I wanted to find in the select clause from how I find it, in the where and order by clause:

select select_clause.*,
round( sqrt( ( ( where_clause.latitude - ?) * (where_clause.latitude - ? ) *  69.1 * 69.1 ) +
( (where_clause.longitude - ? ) *(where_clause.longitude - ? ) * 53 * 53 ) ) ) as distance
from office_building where_clause, office_building select_clause
where where_clause.latitude < ? + (1/69.1)*5
and where_clause.latitude > ? - (1/69.1)*5
and where_clause.longitude < ? + (1/53)*5 and
where_clause.longitude > ? - (1/53)*5
and where_clause.id = select_clause.id
order by distance
limit 100

You also need to add an index:

create index idx_nearby on office_building (latitude,longitude,id);

Then, when you run the query, you still have the filesort, but you also see the magic 'Using index' in your explain plan.  You never have to go to the table to do the sort!  You also have a join now, but it's on the primary key, and you only need to go to the table for the 100 rows that you know you want.

Using this query had an effect on one live system of one to two orders of magnitude increase in query speed, depending on the query.  This not only works for distance queries, but anytime you want to order by a calculated value.

More useful links: geo search suggestions, index explanation

Technorati Tags: , ,

© Moore Consulting, 2003-2017 +