{"id":3144,"date":"2018-09-10T14:56:33","date_gmt":"2018-09-10T20:56:33","guid":{"rendered":"http:\/\/www.mooreds.com\/wordpress\/?p=3144"},"modified":"2021-11-21T12:38:40","modified_gmt":"2021-11-21T18:38:40","slug":"aborted-adventures-with-amazon-athena-and-us-pto-data","status":"publish","type":"post","link":"https:\/\/www.mooreds.com\/wordpress\/archives\/3144","title":{"rendered":"Aborted Adventures with Amazon Athena and US PTO data"},"content":{"rendered":"<p><img loading=\"lazy\" decoding=\"async\" class=\"alignright size-medium wp-image-3146\" src=\"http:\/\/www.mooreds.com\/wordpress\/wp-content\/uploads\/2018\/09\/mythology-1099255_640-300x235.jpg\" alt=\"Goddesses\" width=\"300\" height=\"235\" srcset=\"http:\/\/edit.mooreds.com\/wordpress\/wp-content\/uploads\/2018\/09\/mythology-1099255_640-300x235.jpg 300w, http:\/\/edit.mooreds.com\/wordpress\/wp-content\/uploads\/2018\/09\/mythology-1099255_640.jpg 640w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/>I was playing around recently with some data (from the <a href=\"https:\/\/bulkdata.uspto.gov\/\">US Patent and Trademark Office<\/a>), trying to import it into S3 and then to Athena. <a href=\"https:\/\/aws.amazon.com\/athena\/\">Athena<\/a> is a serverless big data query engine with schema on read semantics. The data was not available on the <a href=\"https:\/\/registry.opendata.aws\/\">AWS public dataset repo<\/a>. Things didn&#8217;t go as well as planned. Here&#8217;s how I wanted them to go:<\/p>\n<ol>\n<li>download some data<\/li>\n<li>transform it into CSV (because <a href=\"https:\/\/stackoverflow.com\/questions\/49694215\/how-to-extract-xml-data-stored-in-amazon-athena\">Athena doesn&#8217;t support currently XML<\/a> and I didn&#8217;t want to go full EMR, even though <a href=\"https:\/\/github.com\/dvasilen\/Hive-XML-SerDe\/wiki\/XML-data-sources\">hive supports XML<\/a>)<\/li>\n<li>upload it to s3 bucket<\/li>\n<li>create a table based on the data<\/li>\n<li>run some interesting queries using Athena<\/li>\n<li>possibly pull some of the data in Amazon Machine Learning to do some predictions<\/li>\n<li>possibly put some of the data in an s3 bucket as JSON and use <a href=\"https:\/\/datatables.net\/\">datatables<\/a> to create a nice user interface<\/li>\n<\/ol>\n<p>Like pretty much every development project I&#8217;ve ever been part of, there were surprises. What was different is that I had a fixed amount of time since this was an exploratory project, I set a timebox. I didn&#8217;t complete much of what I wanted to get done, but wanted to document what I did.<\/p>\n<p>I was able to get through step 5 with a small portion of data (13k rows). I ended up working a lot on windows because I didn&#8217;t want to boot up a vagrant box. I spent a lot of time re-learning XSLT in order to pull the data I wanted out of the XML. I used a tool called <a href=\"http:\/\/xmlstar.sourceforge.net\/\">xmlstarlet<\/a> for this, which worked pretty well with the small dataset. Here&#8217;s the command I ran to pull out some of the attributes of the XML dataset (you can see that I also learned about batch file arguments):<\/p>\n<p><code>xml sel -T -t -m \/\/case-file -v \"concat(serial-number,',',registration-number,',',case-file-header\/registration-date,'\\n<\/code><\/p>\n<p><code>,',case-file-header\/status-code,',',case-file-header\/attorney-name)\" -n %filename% &gt; %outfile%<\/code><\/p>\n<p>And here&#8217;s the Athena schema I created:<\/p>\n<p><code><br \/>\nCREATE EXTERNAL TABLE trademark_csv (<br \/>\nserialnumber STRING,<br \/>\nregistrationnumber STRING,<br \/>\nregistrationdate STRING,<br \/>\nstatuscode INT,<br \/>\nattorneyname STRING<br \/>\n)<br \/>\nROW FORMAT DELIMITED<br \/>\nFIELDS TERMINATED BY ','<br \/>\nESCAPED BY '\\\\'<br \/>\nLINES TERMINATED BY '\\n'<br \/>\nLOCATION 's3:\/\/aml-mooreds\/athena\/trademark\/';<br \/>\n<\/code><\/p>\n<p>After I had done the quick prototype, I foolishly moved on to downloading the full dataset. This caused some issues with disk storage and ended up taking a long time (the full dataset was ~300 files from 500M to 2GB in size, each containing about 150k records). I learned that I should have pulled down one large file and worked it through my process rather than making automating each step as I went. For one, xmlstarlet hasn&#8217;t been updated for years, and I couldn&#8217;t find a linux package. When tried to compile it, it was looking for libxml, which was installed on my ec2 instance already. I didn&#8217;t bother to head further down this path. But I ran into a different issue. When I ran xmlstarlet against a 500MB uncompressed XML file, it completed. But any of the larger files caused it to give an &#8216;out of memory&#8217; error. I saw <a href=\"https:\/\/sourceforge.net\/p\/xmlstar\/bugs\/107\/\">one reference in the bugtracker<\/a>, but it didn&#8217;t seem to apply.<\/p>\n<p>So, back to the drawing board. Luckily, many languages have support for event based parsing of XML. I was hoping to find a command line tool that could run XSLT in order to reuse some of my logic, but it doesn&#8217;t appear to exist (found <a href=\"https:\/\/news.ycombinator.com\/item?id=1930229\">this interesting discussion<\/a> and <a href=\"https:\/\/stackoverflow.com\/questions\/30305724\/how-to-do-command-line-xpath-queries-in-huge-xml-files\">this one<\/a>). python seemed like <a href=\"https:\/\/stackoverflow.com\/questions\/7171140\/using-python-iterparse-for-large-xml-files#7171543\">it might work well<\/a>.<\/p>\n<p>Then I ran out of time. Oh well, maybe some other time. It is fun to think about how I can automate all of this. I was definitely seeing where lambda functions and some other AWS features could have fit in nicely. I also think that using RDS might have made more sense than Athena, given the rate of update and the amount of data.<\/p>\n<p>Lessons learned:<\/p>\n<ul>\n<li>what works for 13k records won&#8217;t necessarily work when you have 10x, let along 100x, that number<\/li>\n<li>work through the entire pipeline with real world data before automating any part of it<\/li>\n<li>use EC2 whenever you need to download a lot of data<\/li>\n<li>make sure your buckets and athena are in the same region. I wasn&#8217;t, and there was no warning. That&#8217;s fine with small data, but could have hurt from a financial viewpoint if I&#8217;d been successful at loading the whole dataset<\/li>\n<li>it can be fun to play around with this type of stuff, but having a timebox keeps you from going down the rabbit hole too far<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>I was playing around recently with some data (from the US Patent and Trademark Office), trying to import it into S3 and then to Athena. Athena is a serverless big [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[79,69],"tags":[],"class_list":["post-3144","post","type-post","status-publish","format-standard","hentry","category-aws","category-big-data"],"_links":{"self":[{"href":"https:\/\/www.mooreds.com\/wordpress\/wp-json\/wp\/v2\/posts\/3144","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.mooreds.com\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.mooreds.com\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.mooreds.com\/wordpress\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.mooreds.com\/wordpress\/wp-json\/wp\/v2\/comments?post=3144"}],"version-history":[{"count":4,"href":"https:\/\/www.mooreds.com\/wordpress\/wp-json\/wp\/v2\/posts\/3144\/revisions"}],"predecessor-version":[{"id":3149,"href":"https:\/\/www.mooreds.com\/wordpress\/wp-json\/wp\/v2\/posts\/3144\/revisions\/3149"}],"wp:attachment":[{"href":"https:\/\/www.mooreds.com\/wordpress\/wp-json\/wp\/v2\/media?parent=3144"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.mooreds.com\/wordpress\/wp-json\/wp\/v2\/categories?post=3144"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.mooreds.com\/wordpress\/wp-json\/wp\/v2\/tags?post=3144"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}