I recently had to find a way to parse a street address into its component parts, and thought I’d share my adventure.
The idea is to take a string like “123 S Main Street” and break it apart into the street number (123), the street direction (S), the street name (Main) and the street type (Street).
At first, I thought that regular expressions would work, but the sheer variety of legal postal street addresses quickly dissuaded me, as did my boss’s misgivings.
Stackoverflow has a nice discussion of the problem, which gave me some additional pointers. There’s a commercial solution, which is available as a COM component or a web service–I didn’t try this. There is a free, but application/attribution required, web service provided by a university that did a great job (thanks, California tax payers). This solution is also available in a for-pay variant.
Neither of these were desirable because we needed to parse a lot of addresses quickly, and calling out over the web can be slow. Some more digging turned up this stack question and JGeocoder, which has a fairly robust address parser. It’s not perfect, but it was free and open source. I am not sure if it is still in development (the author didn’t respond to my email) but it does what we need it to do.
As an added bonus, we’re using pentaho for the data processing, and you can call java classes directly from your data processing steps, so I didn’t even have to wrap the java call in a shell script or anything.
I’m working with Pentaho Data Integration 4 and have need to parse addresses. Like you, I found JGeocoder which looks promising. Unlike you, I’m not well versed in java. I wondered if you might be willing to share how you’ve used JGeocoder with Pentaho. Thanks!
Hi Mitch,
I think you need to just put the jar file in the libext directory, as specified here:
http://forums.pentaho.com/showthread.php?77190-Custom-Plugin-External-JARs
Then I used a javascript step:
try {
var results = net.sourceforge.jgeocoder.us.AddressParser.parseAddress(address);
if (results != null) {
splitStreetNumber = results.get(net.sourceforge.jgeocoder.AddressComponent.NUMBER);
splitStreetDir = results.get(net.sourceforge.jgeocoder.AddressComponent.PREDIR);
splitStreetName = results.get(net.sourceforge.jgeocoder.AddressComponent.STREET);
splitStreetType = results.get(net.sourceforge.jgeocoder.AddressComponent.TYPE);
splitUnitNumber = results.get(net.sourceforge.jgeocoder.AddressComponent.LINE2);
} else {
writeToLog("unable to parse this address: "+address);
}
} catch (e) {
writeToLog("exception trying to process this address: "+address+", e:"+e);
}
The split fields are then added to the pentaho stream. You’ll want to watch out for nulls in the address field you give to JGeocoder–it isn’t great about handling them.
Hi Dan …
I too stumbled across this 3 year-old project and am attempting to integrate it into my Java application. I’m finding the solution to be less than ideal, with a lot of failures in places I would expect it to succeed. I’m curious … what’s your success rate with this software? I mean, of the number of addresses JGeoCoder fails to do correctly, what percentage would you look at the raw data and think “Wow, I’m shocked it got this wrong”?
Hi Steve,
Well, I’m not using the JGeocoder classes to geocode, just to split addresses into component parts. As I recall, it was pretty good about that, although some street types like ‘heights’ confuse it.
As far as geocoding options, I’d suggest looking at Google’s geocoding: http://code.google.com/apis/maps/documentation/geocoding/ which is free up to a limit, but only usable if you are going to put the results on a google map. We actually use TeleAtlas: http://geocode.com/ which is a bit pricy but is generally pretty good (except in very new developments). We started out using the Tiger Line database: http://www.census.gov/geo/www/tiger/ which is free (and what JGeocoder uses: http://jgeocoder.sourceforge.net/ ) but the quality wasn’t very good.
Dan
See duoshare.com they have solutions via the web and webservices for single address and batch processing .