Skip to content

Always break rails migrations into smallest chunks possible, and other lessons learned

So this was a bit of a sticky wicket that I recently extracted myself from and I wanted to make notes so I didn’t make the same mistake again. I was adding a new table that related two existing tables and added the following code

class CreateTfcListingPeople < ActiveRecord::Migration
  def change
    create_table :tfc_listing_people do |t|
      t.integer :listing_id, index: true
      t.string :person_id, limit: 22, index: true

      t.timestamps null: false

    add_foreign_key :tfc_listing_people, :people
    add_foreign_key :tfc_listing_people, :listings


However, I didn’t notice that the datatype of the column (which is a varchar) was `id` varchar(22) COLLATE utf8_unicode_ci NOT NULL

This led to the following error popping up in one of the non production environments:

2018-02-27T17:10:05.277434+00:00 app[web.1]: App 132 stdout: ActionView::Template::Error (Mysql2::Error: Illegal mix of collations (utf8_unicode_ci,IMPLICIT) and (utf8_general_ci,IMPLICIT) for operation '=': SELECT COUNT(*) FROM `people` INNER JOIN `tfc_listing_people` ON `people`.`id` = `tfc_listing_people`.`person_id` WHERE `tfc_listing_people`.`listing_id` = 42):

I was able to fix this with the following alter statement (from this SO post): ALTER TABLE `tfc_listing_people` CHANGE `person_id` `person_id` VARCHAR( 22 ) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL.

But in other environments, there was no runtime error. There was, however, a partially failed migration, that had been masked by some other test failures and some process failures, since there was a team handoff that masked it. The create table statement had succeeded, but the add_foreign_key :tfc_listing_people, :people migration had failed.

I ran this migration statement a few times (pointer on how to do that): ActiveRecord::Migration.add_foreign_key :tfc_listing_people, :people and, via this SO answer, I was able to find the latest foreign key error message:

2018-03-06 13:23:29 0x2b1565330700 Error in foreign key constraint of table sharetribe_production/#sql-2c93_4a44d:
 FOREIGN KEY (person_id)  REFERENCES people (id): Cannot find an index in the referenced table where the
referenced columns appear as the first columns, or column types in the table and the referenced table do not match for constraint. Note that the internal storage type of ENUM and SET changed in tables created with >= InnoDB-4.1.12, and such columns in old tables cannot be referenced by such columns in new tables.
Please refer to for correct foreign key definition.

So, again, just running the alter statement to change the collation of the tfc_listing_people table worked fine. However, while I could handcraft the fix on both staging and production and did so, I needed a way to have this change captured in a migration or two. I split apart the first migration into two migrations. The first created the tfc_listing_people table, and the second looked like this:

class ModifyTfcListingPeople < ActiveRecord::Migration
  def up
    execute <<-SQL
      ALTER TABLE  `tfc_listing_people` CHANGE  `person_id`  `person_id` VARCHAR( 22 ) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL

    add_foreign_key :tfc_listing_people, :people
    add_foreign_key :tfc_listing_people, :listings
  def down
    drop_foreign_key :tfc_listing_people, :people
    drop_foreign_key :tfc_listing_people, :listings

Because I’d hand crafted the fixes on staging and production, I manually inserted a value for this migration into the schema_migrations table to indicate that the migration had been run in those environments. If I hadn’t had two related but different migration actions, I might not have had to go through these manual gyrations.

My lessons from this episode:

  • pay close attention to any errors and failed tests, no matter how innocuous. This is a variation of the “broken window theory”
  • break migrations into small pieces, which are easier to debug and to migrate back and forth
  • knowing SQL and having an understanding of how database migrations work (they are cool, but they aren’t magic, and sometimes they leak) was crucial to debugging this issue

Posting to REST APIs from mysql triggers

This was a fun crazy idea that turned out not to be so good in practice. If you have a mysql table you are monitoring for changes, you can use a trigger to do so (as long as you have a semi-modern version of msyql).

Sometimes you might want to notify another service of any change (remote logging service, message queue, etc). For instance, at 8z, when the price of a listing changes, this is an interesting event that other software should be notified of.

The first step is to install the mysql http UDF (all commands below are for centos).

$ sudo yum install curl-devel

$ CPPFLAGS="-I/usr/include/mysql"  ./configure --with-mysql=/etc/my.cnf   --enable-shared

$ make

This gives us a .la file (built with libtool) but luckily there’s a .so file hiding:

sudo cp .libs/ /usr/lib64/mysql/plugin/

Then, your remote rest service will probably want JSON, so you’ll want to visit the mysql udf hub. There lives a JSON function. I just grabbed the C file from github and compiled it: gcc -fPIC -shared -o -I/usr/include/mysql/  lib_mysqludf_json.c

After these are compiled you have to copy the .so files to the appropriate directory and then add functions as a privileged user, per usual UDF convention.

Now you can create a trigger that calls these functions.

delimiter $
    IF NEW.amount > 0 THEN 
      set @json = select json_object(account_id,amount) 
      select http_post('',@json); 
    END IF; 


Note that I didn’t actually implement this, so the code up above is based on my memory. If you try this and have some corrections, please leave them in the comments and I’ll update this post.

Why didn’t I actually implement this, after doing the better part of a day’s worth of research? The database is not the best place for this kind of logic. Error handling for triggers is weak–if the trigger failed for any reason (like the remote service was down), you would need to build an error logging system, or send an email. Also, if there are a number of updated rows, which all trigger outbound http calls, you might run into performance issues which would be difficult to replicate or analyze, and, most importantly, might impact your database’s ability to act as a database. The three tier architecture exists for a reason.

But, it was fun to investigate, and, as my colleague said, would have been cool if it had worked. If you are still interested, there’s more on this topic here.

Update Jan 2021: The HTTP UDF is now here and a reader pointed out they had trouble with the above instructions, but that a golang UDF implementation worked fine.

Simple REST database solutions

I’ve been looking for a simple solution to generate a REST api for a mysql database, with minimal to no coding.  I need to be able to do simple inserts and updates, and some simple querying (including for number ranges and or clauses).

The options are few and none seems to entirely fit the bill.

  • RestSQL is a java layer that has a simple XML configuration.  It supports some advanced features (triggers for biz logic, table composition) but doesn’t yet have numeric comparisons for gets.
  • DBSlayer is a C program that doesn’t do REST, but does JSON results for SQL queries over HTTP
  • PHPRest is a RESTful interface.  From the docs, it is not clear if it supports JSON or any queries more complex than by primary key.

Turning on hibernate query caching

Sometimes I write a blog post because I want to say something to others, other times because I want to help others, and sometimes I write one for myself.  This is one of the latter type of posts.

I find that nothing drives home a lesson I learned like writing out the solution on my blog, and more than one time I’ve searched my blog because I remember I had written about a topic of current interest in the past.

If you want hibernate, a java ORM tool, to cache your query results, there’s a section in the hibernate manual. Like most of the hibernate documentation, it’s quite good.  One thing that caught me up, though, is that I thought that all queries were cached, as soon as you had set up the correct configuration in your hibernate xml files and marked your objects as cacheable.

However, this is not the case,  as I learned when I turned on the mysql query log and looked at the calls the interesting web application was making.  I did quite a bit of searching, but the answer was in the same section that told me how to set up the configuration (as well as in this forum post).

…most queries do not benefit from caching or their results. So by default, individual queries are not cached even after enabling query caching. To enable results caching for a particular query, call org.hibernate.Query.setCacheable(true).

Doh!  Following these instructions decreased the page load time for some heavy query pages from 8 seconds to 0.5 seconds (the second time the page was loaded).  Impressive indeed, though the production environment might not see such drastic improvements.

One additional note–you can use query caching for SQLQueries (like group operations) by telling hibernate what the type is of each returned value.  If you don’t use addScalar, but you do mark a SQLQuery as cacheable, you’ll get a java.lang.ArrayIndexOutOfBoundsException when that code executes.  More here.

These articles were helpful to me as I navigated Hibernate caching, above and beyond the reference documentation:

Moving data from MS-SQL to mysql

I recently worked on a project where I needed to port data from a MS-SQL database to a mysql database.  There are programs, both payware and freeware that will help with this process, but I didn’t have ODBC access to the MS-SQL database, which these programs require.  All I had were a bunch of insert statements that looked like this:

[online],[gender]) VALUES(463,N'15',N'7',N'2009',N'9:13:52 AM',N'Homer',
N'234 Main',N'Alaska',N'Springfield',NULL,N''),
CAST(0x9AE90000 AS SmallDateTime),
CAST(0x00009CD700000000 AS DateTime),N'Man');

I wrote a perl script to turn that dialect of SQL into a mysql friendly dialect (feel free to download it).

The most interesting parts were those CAST statements.  This forum post and this blog post helped me turn those casts into real dates.  (After loading the inserts into mysql, I did some post processing, using the helpful case statement and str_to_date function to rationalize some of the data.)

[tags]mssql, mysql, data migration[/tags]

Hadoop logging case study

Here’s a fascinating case study of the evolution of a (large!) log analysis toolset. It follows Rackspace as they strive to allow searching logfiles from an ever growing number of servers, starting from manual ssh/grep, through various incarnations of a MySQL database, and ending up using Hadoop/Lucene/Solr.  There’s (some) more info on the rackspace blog.  Cool stuff.

Via George Fairbanks.

[tags]lucene, solr,case study,hadoop[/tags]

Optimizing a distance calculation in a mysql query

If you have a query that sorts by a derived field, and then takes a limited number of the results, it can be a real dog.  Here’s how I optimized a situation like this.  Imagine this table.

create table office_building (
id int primary key,
latitude float not null,
longitude float not null,
rent int,
address varchar(20),
picture_url varchar(255)

If you want to find the nearest 100 office buildings to a point on a map, you run a query something like this (plug your lat/lng into the question marks):

explain select *, round( sqrt( ( ( (latitude - ?) * (latitude - ?) ) *  69.1 * 69.1) +
((longitude - ?) * (longitude - ?) * 53 * 53 ) ) ) as distance
from office_building order by distance limit 100

(See here for an explanation of the 69.1 and 53 constants–basically they convert roughly from lat/lng to miles.) Unfortunately, you are ordering by a derived field, and mysql can no longer do order by optimization.

This means that you’ll be doing a filesort (which does not actually have anything to do with the filesystem, but is just a sort not on an index).  And this, in turn means that your performance will suck if you have any large number of rows returned.

You can help things out a bit by limiting your office building query to a box of a certain size around the point.  Here’s the query with a 5 mile box:

select *, round( sqrt( ( ( (latitude - ?) * (latitude - ?) ) *  69.1 * 69.1 ) +
( (longitude - ?) * (longitude - ?) * 53 * 53 ) ) ) as distance
from office_building
where latitude < ?  + (1/69.1)*5 and latitude > ? - (1/69.1)*5 and longitude < ? + (1/53)*5 and longitude > ? - (1/53)*5
order by distance limit 100

But if you still have too many results, the sorting on distance will be slow.  Also, even if you have an index on latitude and longitude, (such as create index idx_nearby on office_building (latitude,longitude)) because you are not using equality, only the first column will be used.

This is worth repeating, because it took me a while to understand.  If you have an index: create index idx on tbl (col1,col2,col3,col4,col5) and you run a query like select count(*) from tbl where col1 = 1 and col2 > 2 and col3 < 3 and col4 > 4 only col1 and col2 will be used from the index.  Mysql goes to the table data files for col3 and beyond (assuming no other indices on the table).  This makes sense when you think about how indices are created and stored, but I didn’t really understand it until I’d been beaten over the head with it.

As stated here: “[mysql] will use the fields [in the index], from left to right, as long as the WHERE clause has “=”. Once it hits a ‘range’ (>, IN, BETWEEN, …), it stops with that field.”  I don’t know why it is not in the mysql index documentation–maybe it is obvious?

The solution I found was to separate what I wanted to find in the select clause from how I find it, in the where and order by clause:

select select_clause.*,
round( sqrt( ( ( where_clause.latitude - ?) * (where_clause.latitude - ? ) *  69.1 * 69.1 ) +
( (where_clause.longitude - ? ) *(where_clause.longitude - ? ) * 53 * 53 ) ) ) as distance
from office_building where_clause, office_building select_clause
where where_clause.latitude < ? + (1/69.1)*5 and where_clause.latitude > ? - (1/69.1)*5
and where_clause.longitude < ? + (1/53)*5 and where_clause.longitude > ? - (1/53)*5
and =
order by distance
limit 100

You also need to add an index:

create index idx_nearby on office_building (latitude,longitude,id);

Then, when you run the query, you still have the filesort, but you also see the magic ‘Using index’ in your explain plan.  You never have to go to the table to do the sort!  You also have a join now, but it’s on the primary key, and you only need to go to the table for the 100 rows that you know you want.

Using this query had an effect on one live system of one to two orders of magnitude increase in query speed, depending on the query.  This not only works for distance queries, but anytime you want to order by a calculated value.

More useful links: geo search suggestions, index explanation

[tags]mysql, performance, query optimization[/tags]

Using phpMyAdmin without the “Show Databases” privilege

phpMyAdmin is a pretty cool piece of software, and a very useful tool. If you haven’t used it before, it’s a full featured web-based interface to MySQL databases. You can use it to add data, update data, delete data, export data, and basically any other management of the database you might need.

I ran into an issue the other day. I was trying to allow phpMyAdmin to access a database on a remote host. The user that phpMyAdmin was connecting as didn’t have the “show databases” privilege (I imagine this is common in shared hosting environments, which is what this was). This, apparently, is what phpMyAdmin uses to populate the drop-down of databases on the left-hand side after you login. Since it didn’t display that drop-down, there is no way of selecting the database to which this user did have access.

I searched for a while, but couldn’t find anyone else with this problem. So I thought I would post the solution I found.

The solution is to hard code authentication data for the remote server in my file.  Then, you append the server and the database that you want to connect to the phpMyAdmin url.

In the file:
$cfg['Servers'][$i]['host'] = '';
$cfg['Servers'][$i]['user'] = 'user';
$cfg['Servers'][$i]['password'] = 'pass';

In the url that normal users use to access the database:

The left hand side still isn’t populated with any databases. But, this allows you to go directly to the database to which you do have access, and perform whatever tasks you want (and have permissions for). I tried adding the “db” parameter to the file, but that didn’t seem to work. I didn’t try using any other form of authentication than hardcoded, as I wanted to make this as simple as possible for the phpMyAdmin users.

I was running an older version of phpMyAdmin (; I didn’t test this on newer versions.  If you do, please post a comment and let me know if this still works.

[tags]phpmyadmin,mysql privileges,remote database access[/tags]

MySQL and memcached: notes from a webinar

Just listened to another mysql webinar. This time it was about mysql and memcached. I got in a little late, but here are my notes.

Memcached is a distributed caching mechanism. Not as robust as Coherence, for example, but free, open source, and widely used and supported.

There is a project that hooks mysql UDFs to memecached.

The main speaker worked at Grazr, a feed manipulation company. He presented some lessons and stated that some seemed obvious, but we all know how it is, you learn mistakes the hard way.

  1. Beware architecture momentum–early decisions influence later ones.
  2. Your scaling plan is wrong in some way!–grazr overbuilt.
  3. Testing is limited: they had issues with replication speed. Users are better at testing site than any perl script.
  4. Replication is fast, until it isn’t. It’s asynchronous nature can make it hard to program to.
  5. Memcached is your friend
  6. Sphinx, an open source SQL full text search engine (like Lucene), is your friend. It also has a mysql storage engine–virtual table that you can join with other data sources. This sounds cool.
  7. Do bulk insert and lazy writes–if you don’t need to immediately read it, do the write later.
  8. Emphasis on scaling hurt user experience. Fast, data correct transaction are available with two options: one server, or a sync buffer like memcached.
  9. Use instrumentation–he mentioned tools like nagios and cacti (cacti has mysql plugin [which I couldn’t find]) to help scaling–close the loop: when you make a change, find out what that change did.
  10. Try new things–for example, the MySQL blackhole storage engine for data that is only read on slaves, not on the master.
  11. Everyone has same problems–avoid Not Invented Here syndrome. Not just the obvious stuff (who writes a web server these days–oh wait), for example, there’s a batch job processor like gearman. Note that he also said they had to Invent a few things before they learned this lesson.
  12. Accept change–design for now+1, not now+100
  13. Listen–even folks not intimately familiar with your technology may have good ideas, and more brains are better.

End notes

Looks like a couple of interesting whitepapers are available (you’ll have to register and get pinged via email a few times by sales folks): “Designing and Implementing Scalable Applications with Memcached and MySQL”. And there are future webinars on the MySQL/memcached topic, one on SAAS applications and one on Scalable web architecture.

In a meta note, I enjoy the webinars because:

  1. Easy to get to
  2. Free
  3. High quality
  4. One can bail if you want (which I did a few weeks ago when MySQL proxy didn’t turn out to be all that interesting to me).
  5. The give me something to blog about 🙂

[tags]webinar, memcached, users are the best testers[/tags]

My experience at the MySQL Performance Coding Webinar

Last Tuesday, I attended a MySQL webinar. I registered on the MySQL website (with a site-specific email address, of course) and periodically am invited to these webinars. I’d tried to attend in the past, but something else (usually billable) always interfered. Not this time!

The talk was titled “Performance Coding for MySQL” and the author, Jay Pipes, did a fantastic job. (He is also the co-author of Pro MySQL.) The slides from the presentation are up, and he also answered questions sent to him during the presentation in some detail as well. His presentation, about an hour in length, covered both basics (like, normalize first (slide 4), think in sets rather than iterators (slides 20-23)–basic, but not intuitive), and under the hood intricacies (like, think about the size of your primary keys and consider record size (slide 6), avoiding deletes with MyISAM (slide 27) and vertical partitioning to take advantage of the query cache (slides 9-11) ) . He also pointed to a script that he wrote to find useless indices.

Well worth my time. Thanks MySQL and Jay, for making a resource like this available and free! There’s a whole lot more, so I’d recommend downloading the slides and giving them a run through, if you interact with MySQL as a DBA or a developer (or, as is often the case, both).

(On that note, I’d like to recommend the MySQL DBA blog for your perusal–apparently recently renamed the ‘Senior MySQL DBA’ blog, heh.)

[tags]mysql dba, think in sets, webinar[/tags]