Skip to content

Small scale data migrations

So, I’ve recently been involved in another data migration, the second one in three years.  These are small migrations, with thousands of records. One person could take care of this size of data migration with effort, but the amount of data is still large enough that manual data re-entry isn’t really an option–the error rate and the cost and the management difficulty mean that a software solution is the better option.

Here are some lessons I learned from these data migrations.

Learn as much as you can about the data models–both the old and the new–as you can.  This includes, in preferred order, talking to any people familiar with the old system, talking to any people familiar with the new system, looking at the databases via a sql client, reading documentation (if any is written), and looking at code.  I spent some time thrashing around in old system code for a while.  Then I asked the developer for a tour, and learned more in that hour than I had in the previous day of looking at code.

Map entities and concepts as early as you can.  Take special note of any that are in the old and not in the new (and what you are planning to do with them).  Those that are in the new and not in the old aren’t as big of an issue.  Also, attributes of entities are as important as entities, so note discrepancies there.  Early on I noticed that one of the two primary entities in the old system did not exist in the new system.  This led to some interesting conversations with the business users that saved me work.

As above, talk to people who are going to be using the new system, and who use the old system, throughout the migration process.  An entity or attribute that will be a royal pain to migrate may not be used anymore!  Or, the business person might have some good ideas on how to map something in the old system into the new system.  Someone who uses the software you are migrating has more domain expertise than you.  Let them try the new system with migrated data as soon as some data is moved. Make sure to guide their experience so they don’t spin their wheels looking in corners of the system that not yet migrated.

Start a spreadsheet of tasks. Doing so means that every time you uncover something that needs to be done while you are in the process of doing something else, you can note it and keep on your original task.  My spreadsheets are simple; three columns are enough: task name, completed (with an X for completion, blank for still open) and notes (for possible implementation solutions, people to talk to, relevant URLs, or any other text that will help me complete the task).

Document all the migration steps, preferably to the point you can cut and paste commands.  Include any discrepancies discovered, special commands to run, access to all needed systems, names of relevant people, areas that need further investigation, and basically anything else you would want handed to you if you were starting on this project.  This helps immensely if you need to pass off the project, or come back to it later (even just a few days), and provides documentation of entities on the old and new system.

Write scripts wherever possible, but don’t try to script the whole process. Access to different servers can be hard to automate.  Use whatever language you feel is best for these scripts.  I’ve used bash, sql, perl, and awk/sed, but I don’t shy away from a compiled language like java, especially if a library exists that can save me time.  Make sure to put these scripts into version control, and document the purpose with comments at the top and a good name.  I wouldn’t worry too much about unit testing or refactoring this software, because chances are it will be seldom used once the migration occurs.

Get familiar with the concatenate function of your database.  Using queries to write DDL for the new system based on data from the old system can save you writing a script in an imperative language.  When migrating from Expression Engine to WordPress, I used a statement like <code>select concat(‘update wp_comments set comment_author_email = ”’,email,”’ where comment_author = ”’,name,”’;’) from exp_comments where name in (select distinct(name) from exp_comments);</code> to generate an update statement for WordPress for each comment author in the EE database.

Think about data types and representations.  Especially if you are moving from one database to another.  When I was moving from MSSQL to MySQL, date fields were particularly thorny.

Realize that these types of projects are typically difficult slogs.  There were moments where I despaired of ever getting through the migration in a timely fashion.  To do it right, you need a fantastic attention to detail, an understanding of the business needs, and an ability to drive things through to the finish.  All of this can be pretty draining–I find it far more draining than bug fixing or building new features.

Control the old and new systems–try to not have new capabilities added during the migration.  If you can’t guarantee that, can the migration wait until the new and old systems stabilize?  If not, checkpoint the migration against the new capabilities during the process, and realize that you are introducing a lot of extra work and complexity into an already complex process.

Have a staging system where you can practice your migrations without affecting anyone.  Plan to go through at least two or three of these new staging systems so that you can get the migration steps solid before you touch production.  Start from a clean slate each time so no time is spent chasing phantom bugs from a previous migration that didn’t finish or wasn’t entirely correct.  This is what makes the migration documentation you write so important.  Be aware that the new stage system and the new production system will not necessarily be the same.

Lastly, avoid committing to a schedule if at all possible.  And if you must, pad it and only commit after you’ve done a thorough analysis.  Because there are so many hidey holes and areas of the old system that you won’t understand, there is a high probability that you’ll be discovering new issues and data you need to migrate halfway through the project.  (This is a special case of the requirements nightmare known as ‘build system B that acts exactly like system A’.)  Communicate progress to the business.

While this is not my favorite type of project, when done well it can have tremendous business value.  Combining newer, more flexible systems with rich older data, without re keying the data, can make system users much happier.  In some cases, if there is no migration, the newer system simply can’t be used.