Big Data

I was playing around recently with some data (from the US Patent and Trademark Office), trying to import it into S3 and then to Athena. Athena is a serverless big data query engine with schema on read semantics. The data was not available on the AWS public dataset repo. Things didn’t go as well as planned. Here’s how I wanted them to go:

download some data
transform it into CSV (because Athena doesn’t support currently XML and I didn’t want to go full EMR, even though hive supports XML)
upload it to s3 bucket
create a table based on the data
run some interesting queries using Athena
possibly pull some of the data in Amazon Machine Learning to do some predictions
possibly put some of the data in an s3 bucket as JSON and use datatables to create a nice user interface

Like pretty much every development project I’ve ever been part of, there were surprises. What was different is that I had a fixed amount of time since this was an exploratory project, I set a timebox. I didn’t complete much of what I wanted to get done, but wanted to document what I did.

I was able to get through step 5 with a small portion of data (13k rows). I ended up working a lot on windows because I didn’t want to boot up a vagrant box. I spent a lot of time re-learning XSLT in order to pull the data I wanted out of the XML. I used a tool called xmlstarlet for this, which worked pretty well with the small dataset. Here’s the command I ran to pull out some of the attributes of the XML dataset (you can see that I also learned about batch file arguments):

xml sel -T -t -m //case-file -v "concat(serial-number,',',registration-number,',',case-file-header/registration-date,'\n

,',case-file-header/status-code,',',case-file-header/attorney-name)" -n %filename% > %outfile%

And here’s the Athena schema I created:

CREATE EXTERNAL TABLE trademark_csv ( serialnumber STRING, registrationnumber STRING, registrationdate STRING, statuscode INT, attorneyname STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' LINES TERMINATED BY '\n' LOCATION 's3://aml-mooreds/athena/trademark/';

After I had done the quick prototype, I foolishly moved on to downloading the full dataset. This caused some issues with disk storage and ended up taking a long time (the full dataset was ~300 files from 500M to 2GB in size, each containing about 150k records). I learned that I should have pulled down one large file and worked it through my process rather than making automating each step as I went. For one, xmlstarlet hasn’t been updated for years, and I couldn’t find a linux package. When tried to compile it, it was looking for libxml, which was installed on my ec2 instance already. I didn’t bother to head further down this path. But I ran into a different issue. When I ran xmlstarlet against a 500MB uncompressed XML file, it completed. But any of the larger files caused it to give an ‘out of memory’ error. I saw one reference in the bugtracker, but it didn’t seem to apply.

So, back to the drawing board. Luckily, many languages have support for event based parsing of XML. I was hoping to find a command line tool that could run XSLT in order to reuse some of my logic, but it doesn’t appear to exist (found this interesting discussion and this one). python seemed like it might work well.

Then I ran out of time. Oh well, maybe some other time. It is fun to think about how I can automate all of this. I was definitely seeing where lambda functions and some other AWS features could have fit in nicely. I also think that using RDS might have made more sense than Athena, given the rate of update and the amount of data.

Lessons learned:

what works for 13k records won’t necessarily work when you have 10x, let along 100x, that number
work through the entire pipeline with real world data before automating any part of it
use EC2 whenever you need to download a lot of data
make sure your buckets and athena are in the same region. I wasn’t, and there was no warning. That’s fine with small data, but could have hurt from a financial viewpoint if I’d been successful at loading the whole dataset
it can be fun to play around with this type of stuff, but having a timebox keeps you from going down the rabbit hole too far

spark photo — Photo by Enlightment Photography

I recently downloaded Apache Spark. After working with HBase for a bit on my last project, it was a joy, even though I know little Scala. I also downloaded some data from the Census Bureau (the 2010 Business Patterns, a pipe delimited file containing information on the business activity of the USA). I used the prepackaged data sources, so I didn’t have to use the Census API, which I have written about previously.

I was able to quickly start up Apache Spark on my workstation (thanks, quickstart!), and then ask some interesting questions of the data. I did all this within the Spark shell using Scala. The dataset I downloaded has approximately 3M rows, so it is large enough to be interesting, but not large enough to need to actually use Spark.

So, what kinds of questions can you ask? Well, given I downloaded the economic activity survey from 2010, I was interesting in knowing about different kinds of professions. I looked at primarily at MSAs (which are “geographical region[s] with a relatively high population density at [their] core and close economic ties throughout the area”). I did this because it was easy to filter them out with a string match, and therefore I didn’t have to look at any kind of code mapping table which I would have to dig into smaller geographic regions.

First, how many different professions are there in the Boulder Colorado MSA? This code:

val datfile = sc.textFile("../data/CB1000A1.dat") val split_lines = datfile.map(_.split("\\|")) val boulder = split_lines.filter(arr => arr[7].contains("Boulder, CO Metropolitan")) val boulder_jobs = boulder.map(arr => arr(10)); boulder_jobs.count();

tells me in 2010 there were 1079 different types of jobs in the Boulder MSA.

Then I wanted to know which MSAs had the most jobs. Thanks to this SO post and the word count example, I was able to put together this query:

val countsbymsa = split_lines.map(arr => arr(7)) .filter(location => location.contains("Metropolitan Statistical Area")) .map(location => (location,1)).reduceByKey(_+_,1).map(item => item.swap) .sortByKey(true, 1).map(item => item.swap); countsbymsa.saveAsTextFile("../data/msacounts");

And find out that the Los Angeles-Long Beach-Santa Ana, CA MSA has the most different jobs, at 2084 (nosing ahead of NYC by 14 jobs), and the Hinesville-Fort Stewart, GA MSA had the fewest at 700 (at least in 2010).

I didn’t end up using the XML utilities I found here, but found the wiki full of useful tips.

Aborted Adventures with Amazon Athena and US PTO data

Fun with Apache Spark And Census Data

Letters to a New Developer

Pages

Subscribe

Socials

Categories

Archives