Skip to content

AWS Compute Savings Plans

I was doing a project for a hackfest at FusionAuth recently (big fan of hackfests and have been for years).

I was looking at ways to decrease our hosting bill (we host with AWS). There are, of course a variety of ways to accomplish this, but I spent some time looking at EC2 savings plans and RDS reserved instances.

After the short presentation I gave at the end of the day, someone asked “What about Compute Savings Plans?”

I looked at them briefly and they seem like an no-brainer.

With this option, you commit to a certain amount of compute spend (any amount you choose, basically) for a certain period of time (one year or three years) with a certain payment plan (all up front, partial upfront, or no upfront). Then AWS gives you the discount in exchange for this guaranteed income.

Plans are controllable via API and purely a billing construct. Being on a plan doesn’t affect the performance or availability of your compute resources. This is unlike other ways to save money such spot instances.

A Compute Plan, notably is not tied to a region, compute option, or instance type. It covers hours spent across EC2, Lambda and Fargate. It’s quite broad. Other options tie you to a region or a instance type family.

Therefore, this seems like an obvious choice to save money while retaining some flexibility, such a key selling point of the cloud.

Anything this obvious raises suspicion in my mind. So I thought I’d look at it the other way and consider why you wouldn’t use a Compute Savings Plan. Here are the reasons I thought of:

  • First, you might not know about it. I didn’t. Hopefully this blog post does a small bit to educate you.
  • Second, you might not be running anything on AWS EC2, Lambda or Fargate. In that case, a plan is useless, of course.
  • You might have highly variable compute needs, which frequently go to zero. In this case, committing to a certain number of hours a year might not be worth it. It depends on how many hours a year you will use and what the savings might be.
  • You might be planning to depart from AWS for any number of reasons. Depending on how serious you are, you could lose money with a commitment to a plan. That said, most migrations take longer than you think. You can probably lower the amount of commit to the point where it makes sense.
  • You don’t have the money to pay upfront. In this case, you can still use the “no upfront” option and save some money. Less, but still some.
  • If you have extremely stable compute needs, you can use Reserved Instances or EC2 Savings Plans (or, frankly, other hosting providers offering less flexibility, such as data center providers) to save more money. Totally possible, but I haven’t run across it. I’ve heard tell of them, though.
  • You prefer Amazon to have your money rather than you. If this resonates and you are open to changing who you give your money to, please contact me. I can use more money.

What am I missing? Are there other reasons why a Compute Savings Plan for AWS is not a good idea?

Finally, here’s more information from The Duckbill Group about Savings Plans, in particular how the savings are calculated and applied.

The most underappreciated AWS service

Three letters for you.

V. P. C.

Amazon Web Services’ Virtual Private Cloud is often the first real hurdle for developers and others looking to understand cloud systems. I know it was for me; I’d not had much networking experience when I first encountered it. To really grok VPC, you need to have at least a mild understanding of network architecture, including subnets and gateways. In my experience, such knowledge isn’t par for the course with software developers.

Most software developers can quickly understand EC2 (oh, a virtual machine) and S3 (ah, a gigantic disk drive). However, VPC’s networking abstractions are tougher. However, VPC is magic.

Let’s talk a bit more about this. AWS built a performant software defined networking layer. But they didn’t just port all the concepts from the physical world into the cloud. At least, when I read the Unix and Linux Sysadmin’s Guide, which dives deep into such things, that’s how it looks to me.

Instead, AWS followed the 80/20 rule, giving developers and architects flexibility to build real network architectures. These architectures can support real world applications, with real world security needs and compliance concerns.

Yes, you can also achieve such separation using different AWS constructs (and should!). IAM and Organizations spring to mind. But for many, making sure that they could easily port their current security posture to AWS made a cloud transition easier.

VPC simplifies the networking layer enough that even developers with little networking experience can understand it (like me!). AWS has even provided network logging so that, should you need to delve deep into your networking layer for troubleshooting or auditing, you can.

VPC is also fundamental to other services. EC2 instances are still the majority of AWS budgets, according to Corey Quinn’s 2020 Re:Invent rebuttal.

VPCs are where those “machines” exist. The fancy managed services AWS wants you to use so they can lock you in? Often they are accessed by dropping ENIs into your VPC.

If not, these services are running in VPCs managed by AWS, therefore inaccessible to you except through tightly managed interfaces. See, the security works!

To top it all off, VPC is free and ubiquitous.

AWS’s VPC is the water to cloud engineers’ fish; we don’t even see it, let alone appreciate it.

Use managed services. Please.

“Use managed services.”

If there was one piece of advice I wish I could shout from the mountains to all cloud engineers, this would be it.

Operations, especially operations at scale, are a hard problem. Edge cases become commonplace. Failure is rampant. Automation and standardization are crucial. People with experience running software and hardware at this scale tend to be rare and expensive. The mistakes they’ve made and situations they’ve learned from aren’t easy to pick up.

When you use a managed service from one of the major cloud vendors, you’re getting access to all the wisdom of their teams and the power of their automation and systems, for the low price of their software.

A managed service is a service like AWS relational database service, Google Cloud SQL or Azure SQL Database. With all three of these services, you’re getting best of breed configuration and management for a relational database system. There’s configuration needed on your part, but hard or tedious tasks like setting up replication or backups can be done quickly and easily (take this from someone who fed and cared for a mysql replication system for years). Depending on your cloud vendor and needs, you can get managed services for key components of modern software systems like:

  • File storage
  • Object caches
  • Message queues
  • Stream processing software
  • ETL tools
  • And more

(Note that these are all components of your application, and will still require developer time to thread together.)

You should use managed services for three reasons.

  • It’s going to be operated well. The expertise that the cloud providers can provide and the automation they can afford to implement will likely surpass what you can do, especially across multiple services.
  • It’s going to be cheaper. Especially when you consider employee costs. The most expensive AWS RDS instance is approximately $100k/year (full price). It’s not an apples to apples comparison, but in many countries you can’t get a database architect for that salary.
  • It’s going to be faster for development. Developers can focus on connecting these pieces of infrastructure rather than learning how to set them up and run them.

A managed service doesn’t work for everyone. If you need to be able to tweak every setting, a managed service won’t let you. You may have stringent performance or security requirements that a managed service can’t meet. You may also start out with a managed service and grow out of it. (Congrats!)

Another important consideration is lock-in. Some managed services are compatible with alternatives (kubernetes services are a good example). If that is the case, you can move clouds. Others are proprietary and will require substantial reworking of your application if you need to migrate.

If you are working in the cloud and you need a building block for your application like a relational database or a message queue, start with a managed service (and self host if it doesn’t meet your needs). Leverage the operational excellence of the cloud vendors, and you’ll be able to build more, faster.

When in doubt, test it out

When I taught AWS certification courses, I’d often get questions about how a service behaved under load or other unusual circumstances. Frequently I could answer from personal experience or by asking other instructors; occasionally class members provided their insights. Sometimes I could dig up relevant vendor documentation.

However, my default answer was:

“Test it for yourself. There’s no substitute for testing.”

This is one of the great advantages of the cloud. When you have a question about the performance or behavior of a service or system, spin it up and test it. This will cost you money and some time configuring the system, but certainly will be cheaper than ordering hardware, racking it and then also configuring the system. When you’re done with your testing, you can tear down the infrastructure and never worry about it again. Sure beats shipping a server back to the manufacturer.

Of course, no testing scenario can replicate production perfectly. But you can get pretty close (especially if you can reuse production traffic).

When you do test, start by documenting what you want to achieve. What is the question you are trying to answer? Make sure to seek feedback from other team members and/or search online, as it’s possible someone has already answered your question. If you do find answers, understand under what circumstances the tests were performed, as the cloud and the offered services change over time.

Some examples of cloud infrastructure questions you might want to answer:

  • How do EBS volumes of different sizes and types perform under load?
  • When a Kubernetes cluster running on GKE is under load, what happens when you add an additional node? An additional pod?
  • What happens when you turn off a NAT gateway while a file is being uploaded to S3 from an EC2 instance in a private subnet (without an S3 VPC endpoint)?
  • What is the cold start time for an empty Azure function? What about a function loading your dlls?

Think about what steps you are going to take to try to answer the question.

With your question and methodology spelled out, spin up your testing environment. Having your infrastructure represented as code will make this quick, especially if you have a complicated environment. If you are creating the test environment manually, record settings and other configuration in a text file to be able to re-create the environment later.

Run your tests. If you are load testing, find an open source or commercial load testing tool. What you need depends on your goals: you need a different tool to test 100k+ simultaneous users on a website than you do when trying to understand how an internal API handles 100 requests/second.

Review the data to see if your questions are answered. More questions or areas of interest may appear. Adjust your tests to answer them.

Once you have your answers to the desired level of certainty, tear down your testing infrastructure.

Document what you tested, how you tested and your results. Circulate this internally to help your team. If possible, publish it on your company blog to both help others in the same boat and to boost your company’s standing in the community.

All the vendor documentation in the world is no substitute for rolling up your sleeves and testing.

Heading to AWS re:Invent 2019

AWS re:Invent logoI’m excited to be heading to AWS re:Invent this week. I’ve never been to Las Vegas (other than stopping at a Chipotle on the outskirts on the way to SoCal), so I’m looking forward to seeing the Strip. I’ve heard it’s a bit of a madhouse, but I did go to the Kentucky Derby this year, so we’ll see how it compares.

I’m also excited to re-connect with people I’ve met at other conferences or only online. There are a number of AWS instructors that I interacted with only over email and Slack that I hope to meet face to face. (If you want to meet up with me, feel free to connect via Twitter.) This is also my first conference “behind the booth”. I have been to plenty of conferences where I was the one wandering around the expo, kicking tires and talking to vendors, so I’m interested to be on the other side.

Finally, I’m excited to get feedback on the new direction Transposit is taking. We’ll be showing off a new tool we’ve built to decrease incident downtime. I wish I’d had this tool when I was on-call, so I’m really looking forward to seeing what people think.

Terraform with multiple workspaces and environments

I recently was setting up a couple of AWS environments for a client. This client had a typical web application which talked to an RDS database. There was DNS, a CDN and other components involved. We wanted to use Terraform to maintain traceability and replicability, and have the same configuration for production and staging, with perhaps small differences like ec2 instance size. We also wanted to separate out the components into their own Terraform workspaces to limit the blast radius (so if one component had changes that caused issues or Terraform corruption, it wouldn’t affect others). Finally, we wanted each environment to have its own Terraform backend, again to separate the environments.

I wasn’t able to complete this project due to external factors (I left the position before testing could be completed), but wanted to share the concepts. Obviously I can’t share the working code, but I set up an example project which is simpler. That’s the project I’ll be examining in this post. I also want to be clear that while I’ve tested this as much as I could and have validated the ideas with others who have more Terraform experience, this hasn’t been run in production. You have been warned. (Here’s the Terraform docs about setting up modules, workspaces and repositories.)

Using a tool like Terraform is great for a number of reasons, but my favorite is that it lets you track changes to cloud infrastructure. More than once I’ve wandered into an AWS account and wondered why certain resources were set up in the way they were, and what might break if I changed them. There are occasionally comments, but it is far better to examine a commit. Even better to review the set of commits and see the customer request or bug tied to it. (Bonus link: learn more about Terraform and other cloudy tools in this podcast episode with the creator of Terraform.)

So this simpler example project has a lambda that writes to an SQS queue. For now, it just writes the date of invocation, but obviously you could have it reach out to an external API, read from a database, or do some kind of calculation. The SQS queue could then be read from by an EC2 instance, which processes the message and perhaps updates a database. You have three components of the system:

  • The lambda function
  • The SQS queue
  • The EC2 instance (implementation of which is left as an exercise for the reader)

The SQS queue is shared infrastructure and needs to be accessed by both of the other systems. However, the SQS system doesn’t need to know about either the lambda or the EC2 instance. Using Terraform, we can create each of these components as their own workspace. Each of the subsidiary systems can evolve or change (for instance, the EC2 instance could be replaced with an autoscaling group) with minimal impact on other systems. They could be managed by different teams as well if that made sense.

To enforce this separation, set up each component as a separate Terraform workspace. (All code is on github here.) I use remote state so that more than one person can manage the terraform state, and use the S3/dynamodb backend because we are targetting AWS and want a free scalable solution. This post assumes you know how to set up Terraform using s3/dynamodb as a remote state storage.

Here’s the outputs of the SQS system:

output "queue_url" {
  value = "${aws_sqs_queue.myqueue.id}"
}

output "queue_arn" {
  value = "${aws_sqs_queue.myqueue.arn}"
}

I explicitly define the output variables so I can pull them in from the lambda and EC2 workspaces. This is how you can do that.

...
data "terraform_remote_state" "sqs" {
  backend = "s3"
  config = {
    bucket = "${var.terraform_bucket}"
    key = "sqs/terraform.tfstate"
    encrypt = true
    dynamodb_table = "terraform-remote-state-locks"
    profile = "${var.aws_profile}"
    region = "us-east-2"
  }
}
...
resource "aws_lambda_function" "mylambda" {
...
  environment {
    variables = {
      sqs_url = "${data.terraform_remote_state.sqs.outputs.queue_url}"
    }
  }
}

The terraform_remote_state block defines the location of the previously defined sqs workspace, and the ${data.terraform_remote_state.sqs.outputs.queue_url} references that url. That is then injected as an environment variable into the lambda, which reads it and uses the url to create an SQS client. It can then post whatever message it wants.

You can see how this would work with any number of configuration parameters. If you have typical three tier database driven application with a separate caching layer you can create each of these major components and inject the values into either the environment (for lambda) or the userdata (for EC2). I’m not sure I’d use this with a microservices architecture because using a services registry might be more appropriate.

Note that the lambda component has a rudimentary lambda function (you have to define something). It also uses Terraform to deploy the lambda code. That’s fine for the toy example, but for production you will want to use a real CI/CD system to deploy your lambdas.

Now, suppose you want to run production and staging environments, because you are ready to launch. Here are the constraints you’d want:

  • Production and staging run the same config (except when staging is changing, of course)
  • Production and staging may differ in a few details (the size of the EC2 instance, for example)
  • Production and staging execute in different AWS accounts to limit access and issues. You don’t want an error in staging to affect production. This is handled by creating different profiles which have access to different accounts.
  • Production and staging execute in different Terraform backends for the same reason as the separate AWS accounts.

Staging and production can use the same git repository, but when pulled down they are kept in two places on the filesystem. This is because you need to specify the profile and the bucket when using terraform init. So you end up running something like these two commands:

git clone git@github.com:mooreds/terraform-remote-state-example.git # staging
git clone git@github.com:mooreds/terraform-remote-state-example.git production-terraform-remote-state-example # production

I set up the project so that staging can be managed by normal terraform commands (since that will happen more often), and that production uses either special incantations or a script. For the initialization of the production Terraform environment, this looks like: terraform init -backend-config="profile=trsproduction" -backend-config="bucket=bucketname". For staging, it’s just terraform init. I didn’t have a lot of luck switching between these two Terraform backends in the same filesystem location, so that having two trees was a straightforward workaround.

Any changes between production and staging are each pulled out to a variable, with the staging value as the default. Then each workspace has a script which applies the Terraform configuration to the production environment. The script sets variables to be the correct value for production. Here’s an example for the lambda workspace:

terraform apply -var aws_profile=trsproduction -var terraform_bucket="mooreds-terraform-remote-state-example-production" -var env_indicator="production" -var lambda_memory_size=256

We pass in the production terraform_bucket in case any references need to be made to the remote state (to pull in the SQS queue url, for example). We also pass in an increased lambda memory size because, hey, it’s production. Other things that might vary between environments: for example, VPC or subnet ids, API endpoints, and S3 bucket names.

For simplicity, we just use two profiles for staging and production (in ~/.aws/credentials), but any way of getting credentials that works with Terraform will work:

[trsstaging]
aws_access_key_id = ...
aws_secret_access_key = ...

[trsproduction]
aws_access_key_id = ...
aws_secret_access_key = ...

This lets us separate out who has production access. Some users can have both staging and production profiles (perhaps operations), and others can have only staging profiles (perhaps developers). You can pass region values in via variables as well.

Using this system, the workflow for a change would be:

  • Check out the terraform git repository
  • Create a feature branch (including an issue identifier)
  • Pull request and approval
  • Run terraform apply to apply to staging
  • Run any additional tests
  • Merge to master
  • Run prodapply.sh

Again, I want to be clear that I’ve implemented this partially, but I didn’t get a chance to run this fully in production. I tested all these concepts with the simple system mentioned above (and you can stand up your own using the code on github). There will be issues that I haven’t experienced. But I hope that this post helps illuminate the complexity of managing multiple workspaces and environments within a single Terraform github repository.

AWS Advent Amazon Machine Learning Post

Last year I wrote about AWS Advent, which is an exploration of the vast reaches of AWS in the first 24 days of December. This year, I submitted a post for it. And that post is now up and available. I wrote about Machine Learning (ML) generally and Amazon Machine Learning (AML) specifically. From the post:

AML Models are immutable, as are data sources. If you need to incorporate ongoing data into your model, which is generally a good idea, you need to automate your datasource and model building process so they are repeatable. Then, when you have new data, you can rebuild your model, test it out, and then change which model is “in production” and making predictions.

Plenty more stuff in there about the ethics of ML and the various pieces of the AML pipeline. Check it out.

 

Moving a database driven side project to Netlify

FishermanI recently moved a side project to Netlify. Netlify, if you aren’t familiar with it, is a static file hosting service with a great workflow, free SSL certificates, and a built in CDN. I made the move because the side project was a database driven site, but didn’t really use other server side interactivity (no user generated content, etc). I haven’t invested much in the side project over the past couple of years, but it still gets tens of hits a day and is useful to a users. It’s a top hit on google for certain keywords. I didn’t want to spend time updating the underlying application, but wanted it to be more secure. The site is only updated over a month or so every year. It is then read only for the next eleven months of the year, as it provides information on a very seasonal service.

Moving to Netlify (or, frankly, any other static site provider) provides the following benefits:

  • extremely low cost (Netlify is free for the level of traffic I have).
  • faster browsing experience for the end user (due to being behind a CDN).
  • free SSL certificates, with no hassle of setting up letsencrypt myself.
  • indirection–the source site can now be anywhere. I could put it on an ec2 instance that I only boot up when I need to push a new build, or could even host it locally. This means that I don’t need to worry about the updating the source application.

I could have chosen to do this with S3/Cloudfront/AWS Certificate manager/Lambda (or using technologies from some other cloud provider). But even though I’m familiar with all the steps to do this, it was lot of work. And Netlify was free and had done all the grunt work of bringing all these technologies (or similar ones, I’m not familiar with the tech behind the site, but they do write about it a bit).

What steps did I take to move a database driven directory site to Netlify? Enough that I wanted to document this for my future self.

  • Change the DNS for the site to have a lower TTL. During the transition from your site to Netlify, users may see versions served from either the old site or the new site, and you want to minimize that.
  • Remove all forms and other server side interactivity that your site has. You can replace them with javascript driven interactivity (like Disqus for comments). Netlify has some support for functions, but I didn’t explore that.
  • Set up Netlify using a default URL (they provide it, it is something like ‘foo-bar-123.netlify.com’). I deploy from Bitbucket. (I know some people hate on Bitbucket, and I don’t like all of their UI, but they are free for private repos, which is a win for me.) This let me understand how to deploy a simple one page site.
  • Download the site. I used wget: wget -mk http://mysite.com/ . The switches set up mirroring and convert all links to be relative.
  • Move the site to a subdirectory (I used web) and configure Netlify to deploy the HTML from that subdirectory. This lets you have other scripts outside of the webroot that can help you build the site.
  • Check in the site with the downloaded content and push it up to Bitbucket.
  • Watch the site be deployed to the Netlify default URL.
  • Access the site via SSL: ‘https://foo-bar-123.netlify.com’ and look in the console for “mixed active content” errors. Fix those as applicable. (If this was for a client, rather than a side project, I’d solve all of these.) Note that if the source site is http only, you may need to hardcode https URLs.
  • See that “pretty” html pages like ‘https://foo-bar-123.netlify.com/about’ are rendered as text.
  • Ask on Stackoverflow about the problem.
  • End up solving it by generating a _headers file after downloading the site. Update Stackoverflow question with the answer.
  • Test again that the site at the default URL looks good.
  • Update the webserver config and DNS for the database driven site. I wanted it to answer to both the old addresses: mysite.com/www.mysite.com and a new address: generator.mysite.com
  • Update DNS to point mysite.com and www.mysite.com to the Netlify site. Instructions here.
  • Wait for DNS to propagate. When it does, check to make sure that SSL works.
  • Add a password for generator.mysite.com
  • Test the download process with the password (the wget command changes to wget --user=USER --password=PASS -mk http://mysite.com/
  • Write a script to do the download process.
#!/bin/sh
wget -mk  --user=USER --password=PASS http://generator.mysite.com/
mv generator.mysite.com mysitecom
cd mysitecom
find `pwd` -type f |grep  -v \\. |sed "s#`pwd`/##" > list 
for i in `cat list`; do echo "/$i" >> _headers; echo "  Content-Type: text/html" >> _headers; done
rm list
cd ..
rm -rf web
mv mysitecom web
git add web
git commit -m "pull in latest from generator.mysite.com"
# could do a git push here as well if you wanted

Now, any time that I make changes to the site, I need to run this script and push to Netlify. I could put it in cron or a scheduled lambda if I thought I was going to make changes often. For now, I’m content to run it manually. One downside is that I self-host my analytics code and that site is not SSL enabled. So I’ll need to make a change if I want to continue to get statistics.

So, in the end I have a fast, secure site that is hosted outside my infrastructure and that is easy to update. This process, while not trivial, was easy enough that it has me thinking about where else I could use it (this blog, for instance). Highly recommend.

Amazon Alexa

I had a lot of fun working on a one day ‘hackfest’ project with Amazon Alexa. I learned a lot about voice UX and Alexa implementation details.It’s an interesting platform, especially if you have broad brand recognition and can deliver high level valuable information via short chunks of text.

From my blog post on the Culture Foundry site:

The multi step interaction is a bit clunky, but I think it’s a great way to avoid collisions between different skills. Basically, the user calls out an ‘invocation’ like ‘open color picker’. Interactions with Alexa after that are send directly to that particular skill until an end point is reached in the interaction tree. Each of these interactions is triggered by a different voice command, and is handled by something called an ‘intent’. Intents can have multiple triggering commands (‘what is my favorite color’ vs ‘what is my color’, for example). There’s also a lightweight, session level storage while the entire invocation is occurring, which means you can easily pass data between intents without reaching out to a more persistent data storage.

You can read the whole post over there.

Aborted Adventures with Amazon Athena and US PTO data

GoddessesI was playing around recently with some data (from the US Patent and Trademark Office), trying to import it into S3 and then to Athena. Athena is a serverless big data query engine with schema on read semantics. The data was not available on the AWS public dataset repo. Things didn’t go as well as planned. Here’s how I wanted them to go:

  1. download some data
  2. transform it into CSV (because Athena doesn’t support currently XML and I didn’t want to go full EMR, even though hive supports XML)
  3. upload it to s3 bucket
  4. create a table based on the data
  5. run some interesting queries using Athena
  6. possibly pull some of the data in Amazon Machine Learning to do some predictions
  7. possibly put some of the data in an s3 bucket as JSON and use datatables to create a nice user interface

Like pretty much every development project I’ve ever been part of, there were surprises. What was different is that I had a fixed amount of time since this was an exploratory project, I set a timebox. I didn’t complete much of what I wanted to get done, but wanted to document what I did.

I was able to get through step 5 with a small portion of data (13k rows). I ended up working a lot on windows because I didn’t want to boot up a vagrant box. I spent a lot of time re-learning XSLT in order to pull the data I wanted out of the XML. I used a tool called xmlstarlet for this, which worked pretty well with the small dataset. Here’s the command I ran to pull out some of the attributes of the XML dataset (you can see that I also learned about batch file arguments):

xml sel -T -t -m //case-file -v "concat(serial-number,',',registration-number,',',case-file-header/registration-date,'\n

,',case-file-header/status-code,',',case-file-header/attorney-name)" -n %filename% > %outfile%

And here’s the Athena schema I created:


CREATE EXTERNAL TABLE trademark_csv (
serialnumber STRING,
registrationnumber STRING,
registrationdate STRING,
statuscode INT,
attorneyname STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION 's3://aml-mooreds/athena/trademark/';

After I had done the quick prototype, I foolishly moved on to downloading the full dataset. This caused some issues with disk storage and ended up taking a long time (the full dataset was ~300 files from 500M to 2GB in size, each containing about 150k records). I learned that I should have pulled down one large file and worked it through my process rather than making automating each step as I went. For one, xmlstarlet hasn’t been updated for years, and I couldn’t find a linux package. When tried to compile it, it was looking for libxml, which was installed on my ec2 instance already. I didn’t bother to head further down this path. But I ran into a different issue. When I ran xmlstarlet against a 500MB uncompressed XML file, it completed. But any of the larger files caused it to give an ‘out of memory’ error. I saw one reference in the bugtracker, but it didn’t seem to apply.

So, back to the drawing board. Luckily, many languages have support for event based parsing of XML. I was hoping to find a command line tool that could run XSLT in order to reuse some of my logic, but it doesn’t appear to exist (found this interesting discussion and this one). python seemed like it might work well.

Then I ran out of time. Oh well, maybe some other time. It is fun to think about how I can automate all of this. I was definitely seeing where lambda functions and some other AWS features could have fit in nicely. I also think that using RDS might have made more sense than Athena, given the rate of update and the amount of data.

Lessons learned:

  • what works for 13k records won’t necessarily work when you have 10x, let along 100x, that number
  • work through the entire pipeline with real world data before automating any part of it
  • use EC2 whenever you need to download a lot of data
  • make sure your buckets and athena are in the same region. I wasn’t, and there was no warning. That’s fine with small data, but could have hurt from a financial viewpoint if I’d been successful at loading the whole dataset
  • it can be fun to play around with this type of stuff, but having a timebox keeps you from going down the rabbit hole too far