I recently downloaded Apache Spark. After working with HBase for a bit on my last project, it was a joy, even though I know little Scala. I also downloaded some data from the Census Bureau (the 2010 Business Patterns, a pipe delimited file containing information on the business activity of the USA). I used the prepackaged data sources, so I didn’t have to use the Census API, which I have written about previously.
I was able to quickly start up Apache Spark on my workstation (thanks, quickstart!), and then ask some interesting questions of the data. I did all this within the Spark shell using Scala. The dataset I downloaded has approximately 3M rows, so it is large enough to be interesting, but not large enough to need to actually use Spark.
So, what kinds of questions can you ask? Well, given I downloaded the economic activity survey from 2010, I was interesting in knowing about different kinds of professions. I looked at primarily at MSAs (which are “geographical region[s] with a relatively high population density at [their] core and close economic ties throughout the area”). I did this because it was easy to filter them out with a string match, and therefore I didn’t have to look at any kind of code mapping table which I would have to dig into smaller geographic regions.
First, how many different professions are there in the Boulder Colorado MSA? This code:
val datfile = sc.textFile("../data/CB1000A1.dat")
val split_lines = datfile.map(_.split("\\|"))
val boulder = split_lines.filter(arr => arr[7].contains("Boulder, CO Metropolitan"))
val boulder_jobs = boulder.map(arr => arr(10));
boulder_jobs.count();
tells me in 2010 there were 1079 different types of jobs in the Boulder MSA.
Then I wanted to know which MSAs had the most jobs. Thanks to this SO post and the word count example, I was able to put together this query:
val countsbymsa = split_lines.map(arr => arr(7))
.filter(location => location.contains("Metropolitan Statistical Area"))
.map(location => (location,1)).reduceByKey(_+_,1).map(item => item.swap)
.sortByKey(true, 1).map(item => item.swap);
countsbymsa.saveAsTextFile("../data/msacounts");
And find out that the Los Angeles-Long Beach-Santa Ana, CA MSA has the most different jobs, at 2084 (nosing ahead of NYC by 14 jobs), and the Hinesville-Fort Stewart, GA MSA had the fewest at 700 (at least in 2010).
I didn’t end up using the XML utilities I found here, but found the wiki full of useful tips.