Skip to content

Small scale data migrations

So, I’ve recently been involved in another data migration, the second one in three years.  These are small migrations, with thousands records.  One person to take care of this size of data migration with some effort, but the amount of data is still large enough that manual data re-entry isn’t really an option–the error rate and the cost and the management difficulty mean that software is the better option.

Here are some lessons I learned from these data migrations.

Learn as much as you can about the data models–both the old and the new–as you can.  This includes, in preferred order, talkingto any people familiar with the old system, talking to any people familiar with the new system, looking at the databases via a sql client, reading documentation (if any is written), and looking at code.  I spent some time thrashing around in old system code for a while.  Then I asked the developer for a tour, and learned more in that hour than I had in the previous day of looking at code.

Map entities and concepts as early as you can.  Take special note of any that are in the old and not in the new (and what you are planning to do with them).  Those that are in the new and not in the old aren’t as big of an issue.  Also, attributes of entities are as important as entities, so note discrepancies there.  Early on I noticed that one of the two primary entities in the old system did not exist in the new system.  This led to some interesting conversations with the business users that saved me work.

As above, talk to people who are going to be using the new system, and who use the old system, throughout the migration process.  An entity or attribute that will be a royal pain to migrate may not be used anymore!  Or, the business person might have some good ideas on how to map something in the old system into the new system.  Someone who uses the software you are migrating has more domain expertise than you.  Let them try the new system with migrated data as soon as some data is moved. Make sure to guide their experience so they don’t spin their wheels looking in corners of the system that not yet migrated.

Start a spreadsheet of tasks, so that everytime you uncover something that needs to be done while you are in the process of doing something else, you can note it and keep on your original task.  My spreadsheets are simple; three columns are enough: task name, completed (with an X for completion, blank for still open) and notes (for possible implementation solutions, people to talk to, relevant urls, or any other text that will help me complete the task).

Document all the migration steps, preferably to the point you can cut and paste commands.  Include any discrepancies discovered, special commands to run, access to all needed systems, names of relevant people, areas that need further investigation, and basically anything else you would want handed to you if you were starting on this project.  This helps immensely if you need to pass off the project, or come back to it later (even just a few days), and provides documentation of entities on the old and new system.

Write scripts wherever possible, but don’t try to script the whole process–access to different servers can be hard to automate.  Use whatever language you feel is best for these scripts.  I’ve used bash, sql, perl, and awk/sed, but I don’t shy away from a compiled language like java, especially if a library exists that can save me time.  Make sure to put these scripts into version control, and document the purpose with comments at the top and a good name.  I wouldn’t worry too much about unit testing or refactoring this software, because chances are it will be seldom used once the migration occurs.

Get familiar with the concatenate function of your database.  Using queries to write DDL for the new system based on data from the old system can save you writing a script in an imperative language.  When migrating from Expression Engine to WordPress, I used a statement like <code>select concat(‘update wp_comments set comment_author_email = ”’,email,”’ where comment_author = ”’,name,”’;’) from exp_comments where name in (select distinct(name) from exp_comments);</code> to generate an update statement for WordPress for each comment author in the EE database.

Think about data types and representations.  Especially if you are moving from one database to another.  When I was moving from MSSQL to MySQL, date fields were particularly thorny.

Realize that these types of projects are typically difficult slogs.  There were moments where I despaired of ever getting through the migration in a timely fashion.  To do it right, you need a fantastic attention to detail, an understanding of the business needs, and an ability to drive things through to the finish.  All of this can be pretty draining–I find it far more draining than bug fixing or building new features.

Control the old and new systems–try to not have new capabilities added during the migration.  If you can’t guarantee that, can the migration wait until the new and old systems stabilize?  If not, checkpoint the migration against the new capabilities during the process, and realize that you are introducing a lot of extra work and complexity into an already complex process.

Have a staging system where you can practice your migrations without affecting anyone.  Plan to go through at least two or three of these new staging systems so that you can get the migration steps solid before you touch production.  Start from a clean slate each time so no time is spent chasing phantom bugs from a previous migration that didn’t finish or wasn’t entirely correct.  This is what makes the migration documentation you write so important.  Be aware that the new stage system and the new production system will not necessarily be the same.

Lastly, avoid committing to a schedule if at all possible.  And if you must, pad it and only commit after you’ve done a thorough analysis.  Because there are so many hidey holes and areas of the old system that you won’t understand, there is a high probability that you’ll be discovering new issues and data you need to migrate halfway through the project.  (This is a special case of the requirements nightmare known as ‘build system B that acts exactly like system A’.)  Communicate progress to the business.

While this is not my favorite type of project, when done well it can have tremendous business value.  Combining newer, more flexible systems with rich older data, without re keying the data, can make system users much happier.  In some cases, if there is no migration, the newer system simply can’t be used.

Tips for using your phone as a wifi hotspot

I recently was working remotely.  Typically when I do this, I try to find a coffee shop or library–someplace I can park for free or cheap and get access to wifi.

On this trip, for a variety of reasons, finding such a place was going to be more of a hassle than usual.  I had noticed my phone had wifi hotspot capability in the past, so I decided to try it out.

I have AT&T service, and an Android phone (an HTC Inspire running Android 2.3.3).  With my existing 2GB/month data plan, I could not turn on wifi hotspot cability.

However, a quick call to AT&T customer service revealed that the only obstacle was upgrading to a 5GB/month data plan, and that I could downgrade easily if I wasn’t happy.  The change would take place immediately (or at the next billing cycle if I desired). After a day of deliberation, I placed the order, and started using my phone as a wifi hotspot.

Pluses

  • great to work from anywhere where I can get a decent signal
  • able to use go to meeting, video, browse the web, email, skype–there was really no online tool I was not able to use
  • you can control number of users and apply access control

Minuses

  • Speed was like DSL (I had 2 bars most of the time and was on the 4G network, but the phone was a 3G hotspot–not sure what that means)
  • A variety of things could cause internet access to stop: incoming phone calls, downloading lots of data (photoblogs), having three or more browser tabs downloading data, trying to upload a photo and download data in another tab
  • Internet access would stop working every thirty minutes or so with normal usage, requiring me to ‘reset the router’ by turning off the hotspot and mobile data, then turning them back on

Tips

  • Use an offline mail client (thunderbird is my favorite, even though it is no longer being developed).  This will let you respond to emails without using notepad or some other desktop app to capture your thoughts.
  • Practice patience, and if you absolutely must have bulletproof network access, go to a coffee shop.
  • Focus browsing on crucial pages.
  • Having another phone available (for web meetings or regular phone calls) is really helpful
  • Have someone else host skype calls–trying to host didn’t work, but I could participate just fine if someone else hosted

Even though the experience wasn’t optimal, frankly it is pretty amazing that I can use my computer and have reasonable access to the internets from anywhere that has a decent phone signal.

Review of Emergent One

A few weeks ago I sat down with some folks at Emergent One and got a demo of their product.  The reason I reached out to them is because I heard great things about their demo at GlueCon.  My company is considering building a mobile app (who isn’t, right!) and I thought that what EmergentOne offered was a great way to accelerate the server side development of that app.

We are slightly outside of their target market–our API would be solely for internal use of a small team, while it seems like they are aiming at companies who want to make an API available to a larger audience (either external or a larger internal development staff).

Regardless of target market, they have a slick product.  They have a self service web application which can generate APIs directly from database tables.  You allow the app select access to your database.  (I believe only mysql and postgresql are supported at the moment, but I know things are moving fast over there as well.)  You then work within the app to build an API based on the tables in your database.  You can have derived fields as well as fields that map to columns in your database directly.  You can filter your data (so if you only want to expose a subset of your data, you can create an endpoint that only displays that: “users over 40”, for example).  You can also add comments to fields.

After you define as many API endpoints as you want, you can manage access to them with application keys or usernames and passwords.  Automatic documentation with datatypes and whatever comments you have added is generated, and there is a developer portal where it is easy to play around with the APIs and see what you missed.

What they showed me is great, but this product is still in private beta.  That means there are some rough edges.  The biggest hole (a fix for which the demoer promised was coming very soon in their development plan) is that you can’t search against the API.  So, if you have an API exposing your pets table, you can create an endpoint to retrieve all pets, and you can retrieve one pet based on id.  You can create a ‘dogs-only’ endpoint, and get all dogs or one dog.  But you can’t query the dogs-only endpoint for dogs that weigh over 25 pounds and have short hair (or any other type of querying).  I only played with read-only APIs, so I’m not sure how the write-access APIs work.

There’s also always the issue of introducing another vendor into the system.  Since we are looking at this for mobile apps, performance is very important.  It seemed like the demoer was well aware of this issue.  He mentioned an SLA would be likely when they went public, and also talked about some of the steps they are taking to make sure their app uses indexes and other metadata about the tables being exposed to execute as quickly as possible.

I haven’t built an API with any of the other tools out there, so I can’t compare the ease of Emergent One with, say, tools like jboss resteasy that work with a java layer, or usergrid, which autogenerates an API but requires moving data into it.  But I can say that this was a very easy way to go database to API in less than an hour (with iptables troubleshooting mixed in!).  If they get searching right–making it easy to use and performant–this will be a fantastic product.

Why isn’t everyone a developer?

I was at a party recently kvetching with some fellow developers and managers of developers about how hard it is to find good talent here in Colorado.  One fellow said he feared a return of the last days of the 1990s, when the hiring bar, and thus the quality of code produced, was drastically lowered.

On the way home, my SO asked why, in a time where many are having difficulty finding jobs, why everyone wasn’t a software developer.  At first, I thought, why not ask “why isn’t everyone a doctor; we have a shortage of those”, but then I realized that it is not applicable.  Especially if you are willing to be a junior developer, the low educational and licensure hurdles to becoming a developer, the remuneration, and the flexibility and intellectual challenges combine to make it a very attractive job.  (Given I am a developer, I may be biased.)

I think there are five attributes you need to have, that progressively fewer people meet.  (I’m purposely ignoring experience programming, because everyone was junior at one time.)

  1. Willingness to work in an office–it’s not everyone who wants to drive (or otherwise commute to an office), sit on their butt every day, deal with office politics, and miss out on natural light.  Even work in a home office or a coffee shop, while alleviating a some of the above issues, is a lot of sitting.
  2. Computer literacy–a developer has to have a basic level of comfort with a computer.  This is a filter that will probably become less relevant for middle class Americans as younger people who grew up with a computer reach the workforce, but for a substantial number of workers, computer interaction is tedious, confusing and irritating.
  3. A logical mind.  Computers aren’t very squishy (fuzzy logic aside) and while logic can be learned by anyone, it takes work to think of the world in the kind of systems that can be modeled in software.
  4. Initiative.  You’ll notice that throughout this, I talk about a developer, not a programmer.  That’s because a programmer, someone who takes a set of specifications and codes them, is a job that is rapidly disappearing.  Developers, who have a broader skill set, including, if need be, the entire stack from business analysis to coding to testing to deployment to support, need to have initiative.  Initiative is, incidentally, a precious attribute for any position.
  5. Lastly, I come to the actual skills of the job–coding, business analysis, testing, etc.  This is the last hurdle, and one of the smaller ones.

So, there are many good reasons why everyone isn’t a software developer, even though there’s plenty of software to be written.

Iterating over a range in Google Spreadsheets

Using the macro facility in Google Spreadsheets can be quite powerful, but I’ve found the documentation difficult to navigate.

For future reference, here is how to iterate over a range of cells that has been predefined (using a named range or in some other manner).

var rangeName = 'mailingdate';
var range = SpreadsheetApp.getActiveSpreadsheet().getRangeByName(mailingDateRangeName);
var arr = mailingDateRange.getValues();
for (var i = 0; i < arr.length; i++) {
var rowVal = arr[i][0];
if (rowVal.length < 2) { // 2 because of spaces
continue;
}
//... process non empty rows.
}

Note that this only processes a range with one column (hence the arr[i][0]. Ranges with multiple columns are left as an exercise for the reader.

What is the difference between a developer and a programmer?

In my mind, developers have skills that are a superset of the skills of a programmer.

Programmers:

  • understand how software works
  • think about the best way to solve problems they are presented with
  • write good code
  • learn new techniques, languages and manners of creating software
  • write documentation
  • write tests

Developers:

  • do all of the above
  • work with the business to define requirements
  • test beyond basic unit tests, up to and including user acceptance testing
  • understand basic system administration
  • do basic database tuning
  • deploy software and solutions
  • can do some UX design
  • can manipulate graphics
  • support and maintain what they’ve written

In other words, developers can span the gamut of needs–they are jack of all trades.  Not all positions are fit for developers because they can be a mile wide and an inch deep, but in general having someone who can handle the entire software lifecycle is more useful in more situations than having someone who is handed a spec and can crank out code.

(Note that I’ve never worked in a huge company for longer than a few months, so perhaps that is where programmers are more useful.)

How do you decide to pull the plug on a project?

Over the past six months, I’ve been involved with two projects that, to put it politely, didn’t go exactly as planned.

Project A was cancelled, after a significant amount of preparatory work had been done, and project B was just released, after a much more significant amount of effort was expended.

What caused one to be cancelled and the other to be continued?

  • Timing: A had a tight schedule, with a deadline that made sense from the business’ perspective.  When it became clear that the deadline would be missed, cancellation was a logical option.  B had goals but no deadline, which made it harder to make a go/no go decision
  • Risk: A required some server rejiggering that would have made it extremely difficult to roll back.  B, on the other hand, would have been a simple software re-release to roll back to previous, somewhat working, software.
  • Business impact: A was a ‘nice to have’ project, whereas B was addressing something that had caused employees and users pain for years.
  • Sunk costs: A had fewer sunk costs than B, which paradoxically made it easier to cancel (paradoxically, because you should not consider sunk costs when looking at an investment).

What is the difference between pushing through to finish a difficult project and polishing a turd that you really should abandon?  It’s a fine balance, and as project B dragged on, I wasn’t sure at times which path we were headed down.  As someone who loves to ship, it’s hard for me to give up on a project that I had a hand in building, but no business writes software in a vacuum, and those business needs can and do serve as valuable checkpoints on the software process.  Because of the huge business value, it made sense to push through on project B.

Sometimes business priorities change, or, as in the case of project A, deadlines impose a different set of priorities (beyond the purely technical).

Regardless, I think that the more I consider any projects I have dragging on through the lenses of timing, risk, return and sunk costs, the easier it will be to make a go/no go decision.

Real estate public records data sources

If you are looking for public records data sources (deeds, sales transactions, assessor records) in an electronic format, you have four options.  Normally this data is kept at the county level, so there haven’t been too many companies that have rolled it up into a nationwide database.  Even for the companies that have, there are timing and data integrity issues that you should be aware of before you commit.

I listed options here because I had to do some digging to find them, and wanted to spare you that.

First, here are some possible solutions that don’t work.

So, what are your four options?

  1. If you are focused on a narrow geographic area, or want to take on a huge project, you can contact each county and see if they have an offering.  I contacted one local county and they wanted $700 for a CD of assessor data.  This has some obvious downsides, but may be enough for your needs.  (I expect some counties won’t offer electronic versions, but that is speculation.)
  2. Core Logic
  3. Data Quick
  4. LPS SiteX

In the latter three cases, you want to call and ask to talk to someone about ‘public record licensing’ if you want to do anything interesting with the data (remix it, aggregate it, display it to your customers, etc).  Public record licensing is not as straighforward as MLS IDX data licensing (and that is saying something!).  It looks like each deal is semi-custom.

You’ll want to speak to a sales representative for full details like pricing, update timing and geographic coverage.  Hope this helps!

Leanpub: write a ‘lean book’

Ever wanted to write a book?  Me too!

Leanpub lets you write a book, but with a twist.  They’ve built a system where you can write portions of a book, and easily publish to the major ebook formats (PDF, .mobi, epub).  You write the book with Markdown, and can include code samples, images, tables, and sections.  Nothing really new there, though.

The revolutionary idea of Leanpub, and the reason it is ‘lean’, is that you can build your book piece by piece, and sell it whenever you have ‘minimum viable content’.  (You can also include sample content to let people see what they are buying.)  Once someone purchases your book, they receive all further updates.  This type of interaction with users can be very helpful–it spurs you on to finish your book (after all, someone paid for it) and also lets you know if your book idea has traction in the marketplace (did anyone buy the book), and builds an audience for your book slowly over time (going to a publisher with a list of people who’ve already bought your book is a lot stronger than going there with a first draft).

After you finish your book, you can submit it to all the other epublishing vendors (Amazon, B&N, etc) or a print on demand store.  (They also have support for building a book directly out of a blog, if you have written anything that structured.)

I have built 4 books on Leanpub.  Using CakePHP to generate Markdown, I was able to combine some articles about CSA I had written with a directory of relevant farms.  You can see the Denver guide.  I found the process fun, if not lucrative.

This is a great concept.  It really makes you, the author, focus on two areas of publishing that technology can’t help with: writing and marketing.  The best book in the world won’t have any sales if people don’t find it, and the most promoted book won’t sell if you don’t write it.

I did see some minor issues around footnoting, and not everyone will enjoy writing in plain text and ‘compiling’ the book to PDF (a process that takes around 30 seconds each time you do it), but the platform seems to be evolving (one of the founders is pretty active in the support google group).  All in all I think Leanpub is worth a look, especially if you are writing a book where a chapter or two will save people time in their job.

If you want to know more about lean publishing ideas, check out the manifesto.