I recently had to find a way to parse a street address into its component parts, and thought I’d share my adventure.
The idea is to take a string like “123 S Main Street” and break it apart into the street number (123), the street direction (S), the street name (Main) and the street type (Street).
At first, I thought that regular expressions would work, but the sheer variety of legal postal street addresses quickly dissuaded me, as did my boss’s misgivings.
Stackoverflow has a nice discussion of the problem, which gave me some additional pointers. There’s a commercial solution, which is available as a COM component or a web service–I didn’t try this. There is a free, but application/attribution required, web service provided by a university that did a great job (thanks, California tax payers). This solution is also available in a for-pay variant.
Neither of these were desirable because we needed to parse a lot of addresses quickly, and calling out over the web can be slow. Some more digging turned up this stack question and JGeocoder, which has a fairly robust address parser. It’s not perfect, but it was free and open source. I am not sure if it is still in development (the author didn’t respond to my email) but it does what we need it to do.
As an added bonus, we’re using pentaho for the data processing, and you can call java classes directly from your data processing steps, so I didn’t even have to wrap the java call in a shell script or anything.