{"id":1939,"date":"2015-01-05T11:51:04","date_gmt":"2015-01-05T17:51:04","guid":{"rendered":"http:\/\/www.mooreds.com\/wordpress\/?p=1939"},"modified":"2014-12-22T22:13:29","modified_gmt":"2014-12-23T04:13:29","slug":"fun-with-apache-spark-and-census-data","status":"publish","type":"post","link":"https:\/\/www.mooreds.com\/wordpress\/archives\/1939","title":{"rendered":"Fun with Apache Spark And Census Data"},"content":{"rendered":"<figure style=\"width: 150px\" class=\"wp-caption alignleft\"><img decoding=\"async\" class=\"alignleft\" title=\"the sparks are flying by Enlightment Photography\" src=\"http:\/\/www.mooreds.com\/wordpress\/wp-content\/uploads\/2014\/12\/6809662987_6d18bed331_q_spark.jpg\" alt=\"spark photo\" width=\"150\" \/><figcaption class=\"wp-caption-text\"><small>Photo by <a href=\"http:\/\/www.flickr.com\/photos\/53104485@N02\/6809662987\" target=\"_blank\">Enlightment Photography<\/a> <a title=\"Attribution-NoDerivs License\" href=\"http:\/\/creativecommons.org\/licenses\/by-nd\/2.0\/\" target=\"_blank\" rel=\"nofollow\"><img decoding=\"async\" src=\"http:\/\/www.mooreds.com\/wordpress\/wp-content\/plugins\/wp-inject\/images\/cc.png\" alt=\"\" \/><\/a><\/small><\/figcaption><\/figure>\n<p>I recently downloaded <a href=\"http:\/\/spark.apache.org\/\">Apache Spark<\/a>.\u00a0 After working with HBase for a bit on my last project, it was a joy, even though I know little Scala.\u00a0 I also downloaded some data from the Census Bureau (the 2010 Business Patterns, a pipe delimited file containing <a href=\"http:\/\/factfinder.census.gov\/faces\/affhelp\/jsf\/pages\/metadata.xhtml?lang=en&amp;type=program&amp;id=program.en.BP\">information on the business activity<\/a> of the USA). I used the prepackaged data sources, so I didn&#8217;t have to use the Census API, which <a href=\"\/wordpress\/archives\/963\">I have written about previously<\/a>.<\/p>\n<p>I was able to quickly start up Apache Spark on my workstation (thanks, <a href=\"http:\/\/spark.apache.org\/docs\/latest\/quick-start.html\">quickstart<\/a>!), and then ask some interesting questions of the data. I did all this within the Spark shell using Scala.\u00a0 The dataset I downloaded has approximately 3M rows, so it is large enough to be interesting, but not large enough to need to actually use Spark.<\/p>\n<p>So, what kinds of questions can you ask?\u00a0 Well, given I downloaded the economic activity survey from 2010, I was interesting in knowing about different kinds of professions.\u00a0 I looked at primarily at MSAs (which are <a href=\"http:\/\/en.wikipedia.org\/wiki\/Metropolitan_statistical_area\">&#8220;geographical region[s] with a relatively high population density at [their] core and close economic ties throughout the area&#8221;)<\/a>.\u00a0 I did this because it was easy to filter them out with a string match, and therefore I didn&#8217;t have to look at any kind of code mapping table which I would have to dig into smaller geographic regions.<\/p>\n<p>First, how many different professions are there in the Boulder Colorado MSA? This code:<\/p>\n<blockquote><p><code>val datfile = sc.textFile(\"..\/data\/CB1000A1.dat\")<br \/>\nval split_lines = datfile.map(_.split(\"\\\\|\"))<br \/>\nval boulder = split_lines.filter(arr =&gt; arr[7].contains(\"Boulder, CO Metropolitan\"))<br \/>\nval boulder_jobs = boulder.map(arr =&gt; arr(10));<br \/>\nboulder_jobs.count();<\/code><\/p><\/blockquote>\n<p>tells me in 2010 there were 1079 different types of jobs in the Boulder MSA.<\/p>\n<p>Then I wanted to know which MSAs had the most jobs. Thanks to <a href=\"http:\/\/stackoverflow.com\/questions\/24656696\/spark-get-collection-sorted-by-value\">this SO post<\/a> and <a href=\"http:\/\/spark.apache.org\/examples.html\">the word count example<\/a>, I was able to put together this query:<\/p>\n<blockquote><p><code>val countsbymsa = split_lines.map(arr =&gt; arr(7))<br \/>\n.filter(location =&gt; location.contains(\"Metropolitan Statistical Area\"))<br \/>\n.map(location =&gt; (location,1)).reduceByKey(_+_,1).map(item =&gt; item.swap)<br \/>\n.sortByKey(true, 1).map(item =&gt; item.swap);<br \/>\ncountsbymsa.saveAsTextFile(\"..\/data\/msacounts\");<\/code><\/p><\/blockquote>\n<p>And find out that the Los Angeles-Long Beach-Santa Ana, CA MSA has the most different jobs, at 2084 (nosing ahead of NYC by 14 jobs), and the Hinesville-Fort Stewart, GA MSA had the fewest at 700 (at least in 2010).<\/p>\n<p>I didn&#8217;t end up using the <a href=\"https:\/\/github.com\/elsevierlabs\/spark-xml-utils\/\">XML utilities I found here<\/a>, but found the wiki full of <a href=\"https:\/\/github.com\/elsevierlabs\/spark-xml-utils\/wiki\/helpful_tips\">useful tips<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I recently downloaded Apache Spark.\u00a0 After working with HBase for a bit on my last project, it was a joy, even though I know little Scala.\u00a0 I also downloaded some [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[69,5],"tags":[],"class_list":["post-1939","post","type-post","status-publish","format-standard","hentry","category-big-data","category-java"],"_links":{"self":[{"href":"https:\/\/www.mooreds.com\/wordpress\/wp-json\/wp\/v2\/posts\/1939","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.mooreds.com\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.mooreds.com\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.mooreds.com\/wordpress\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.mooreds.com\/wordpress\/wp-json\/wp\/v2\/comments?post=1939"}],"version-history":[{"count":1,"href":"https:\/\/www.mooreds.com\/wordpress\/wp-json\/wp\/v2\/posts\/1939\/revisions"}],"predecessor-version":[{"id":1941,"href":"https:\/\/www.mooreds.com\/wordpress\/wp-json\/wp\/v2\/posts\/1939\/revisions\/1941"}],"wp:attachment":[{"href":"https:\/\/www.mooreds.com\/wordpress\/wp-json\/wp\/v2\/media?parent=1939"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.mooreds.com\/wordpress\/wp-json\/wp\/v2\/categories?post=1939"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.mooreds.com\/wordpress\/wp-json\/wp\/v2\/tags?post=1939"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}