Using AWS for load testing experimentation

Someone with heavy weightThe cloud is amazing for load testing your system. If you design your system to be behind a load balancer (which, in many applications, means pushing state to a database and having stateless compute nodes), you can easily switch out those nodes in different scenarios.

I just load tested a system I’m working on and changing out the compute nodes was fairly easy. Once I’d built a number of servers (something I scripted partially but didn’t fully automate because the return wasn’t there) and troubleshot some horizontal scaling issues that popped up in the application, I was able to:

  • take a server out of service behind the load balancer
  • stop it
  • change the instance type
  • start it
  • re-run any needed config changes on the server
  • update DNS if needed (depending on if you have a pinned IP address or not)
  • add it back to the load balancer

Swap out a few instances and you have a new setup for your load test. When you are done, follow the process in reverse to save yourself some money.

Incidentally, increasing the number or size of compute nodes didn’t have the desired effect of being able to handle more load.

What turned out to be the root issue? The database was pegged, both in terms of CPU and connections. Just goes to show that when you’re load testing, you really need to be looking at different aspects of the system, thinking about where your weak bottlenecks are, and use the scientific method of hypothesis, experiment, result.

Follow the money, cloud edition

Clouds in the sky

No, not that kind of cloud

This post was really eye opening and lets you know who are the real players in the public cloud space. I especially enjoyed the metric of capex as percent of revenue. From the post:

As I keep repeating, CAPEX is both a prerequisite to play in the big boy cloud and confirmation of customer success. Both IBM and Oracle are tens of billions of dollars in cloud infrastructure CAPEX behind Amazon, Google, and Microsoft. Oracle’s spending has at least ticked up, but their spending is not enough to keep pace, much less to have any hope of catching up to the infrastructure of the big three.

The whole post is worth reading if you are interested in public cloud providers in any way.

Obstacles to building high availability software systems

Open sign

Is your system available?

I saw a discussion on a slack about obstacles to high availability systems and wanted to record the edited version for posterity (mostly for future me, as I blog for myself). Note that in any mention of high availability systems would be remiss if I didn’t mention the Google SRE book, which is slow reading but free and full of great information.

First, what is high availability? I like this definition from Digital Ocean:

In computing, the term availability is used to describe the period of time when a service is available, as well as the time required by a system to respond to a request made by a user. High availability is a quality of a system or component that assures a high level of operational performance for a given period of time.

Design considerations of a system that will hinder high availability fall into two categories.

The first category is actions that you don’t take, but could take:

  • single points of failure: if you have a piece of your system which is unique and it fails (and everything fails, all the time), the entire system’s availability will be affected.
  • missing or incomplete automation: if you need human beings to resurrect failed parts of your system, it will meaningful amounts of time and will be error prone.
  • failing to build in elasticity and scalability of resources: when usage increases, new resources should be automatically brought online. Failure to do so will impact system performance and that could impact system availability
  • missing or incomplete system instrumentation: if you don’t monitor your system, you won’t be able to even know its availability (until you hear from your users).
  • application statefulness (on the compute nodes): this impacts your ability to use elastic resources and to grow parts of your system that are under load. (If you aren’t designing a greenfield system, this may be an externally imposed requirement due to existing software.)

The second is in actions you can’t take because of external requirements on the system:

  • data sovereignty: if you are legally limited to certain data centers, you have fewer options for your system, this can hinder building the system.
  • tenancy: if you need to have single tenancy for security or legal reasons, you may have fewer options for elastic solutions.
  • data models and authority requirements: poorly performing data models can impact performance. If your application requires certain operations must be from the source of record (permissions checks, for example) then a poorly performing source data model can impact performance which can impact availability.
  • latency: if you have a highly latency sensitive system, then you may need to trade availability for decreased latency. Since availability often means geographic dispersion (to avoid disasters impacting multiple pieces of a system), it impacts latency requirements.
  • cost: high availability systems, because they have no single points of failure, cost more.

Again, this was a discussion from a slack of AWS instructors, but the commentary is mine, as are any mistakes. Thanks to Chad, Richard, Jon, Ryan and everyone else!

Who’s Afraid of Continuous Deployment?

Fish leaping to a larger pool

Leaping to larger pool

So, who’s afraid of continuous deployment? I am, for one. And I’m not alone. I taught hundreds of people in AWS courses over the past two years. We often discussed continuous delivery and deployment and I asked if this was practiced at their places of work. I’d say about 5-10% of folks said yes. I conducted a very informal survey across two technical slacks as well. Unfortunately I had my terms wrong and asked about continuous delivery:

Wanted to do a quick poll. Can you please give a thumbs up to this message if you or your team does continuous delivery of your software product, and a thumbs down if you don’t. And a :penguin: if it doesn’t apply?

The results were:

  • Did CD: 27
  • Did not do CD: 25
  • Does not apply: 3

In the poll, I defined continuous delivery as “if a change is merged to the mainline branch and passes all the tests, it is deployed to production (or whatever environment your customers see) without human involvement”. This was actually a source of discussion, as some folks were very close to this (they deployed to beta environments where only a few customers saw it, or required one human to push a button to actually release, but everything up to that point was automated). Also, someone shared this link about the difference between continuous delivery and continuous deployment. Turns out I was using the term continuous delivery incorrectly. What I defined as continuous delivery was actually continuous deployment. Whoops!

That said, it was interesting that a large number of folks did not deploy code automatically, almost half (note that I believe the poll had a bias because I asked in one slack on the #devops channel. The numbers from the other slack had less than half doing continuous deployment). I’ve worked at a number of small startups, some without paying customers, and I’ve never worked in a place with continuous deployment. I’ve been in jobs with continuous integration and continuous delivery (and this provides a lot of value) but not continuous deployment. I wanted to talk about some reasons why.

The first reason is that continuous deployment simply doesn’t apply. If you are building software that is deployed to customer sites (on-prem), or is tied to hardware, then it doesn’t make sense to work toward CD because there will always be a manual delivery component. Another reason why it might not apply is legal compliance. Folks in the slacks pointed out that in some regulatory regimes you legally are required to have a human ‘push a button’ to deploy because more than one person needed to be involved in a code deploy to satisfy the law and the auditors. These are totally legitimate reasons for not doing continuous deployment.

Next, let’s discuss the reasons based on fear or lack of software hygiene (automated tests or a robust type system). Before I step into this, I want to acknowledge that there may be times in the life of your business where such software hygiene is detrimental to your chances of survival–you need to get an MVP out and test your value in the market, for example. However, in my years of experience I find that following proper software hygiene is far easier to do if adhered to from the beginning. If you don’t, eventually the difficulty of changing the system will grow along with its complexity. You can bolt on testing later, but it is difficult.

I also want to emphasize that I’ve been in all these situations myself. In some ways this blog post is a warning for future me when I try to shirk these practices.

  • If you don’t have automated test coverage, continuous deployment is reckless. This often happens in systems where the testing was bolted on after the system had been developed for a while. The solution is to work towards having enough test coverage to give yourself confidence (it swaddles your code).
  • A system may have configuration deeply tied to a database. Many content management systems are in this boat, which makes it very difficult to roll new configuration forward automatically.
  • Not having an automated rollback strategy. If you are going to continuously deploy, you need to have a way to rollback with confidence, with one script. If you are on heroku, heroku rollbacks help here. If you are running rails code, you can use db:rollback but you’ll need to know how many steps to rollback (I couldn’t find anything that rolled all migrations back to a given timestamp) and you’ll want to be careful about losing data. It may make more sense to run migrations in a different release, and always have the code be backward compatible. Lots of interesting reading about that strategy in strong_migration’s docs. This solution will vary from application to application.
  • Not having enough users to safely canary. One way to know if your new release has problems is to do a blue/green deployment and send just a fraction of your traffic there (you could use a weighted DNS round robin solution). But if you only have a small number of users, the canary userbase won’t adequately run through all the code paths.
  • Fear of breaking key user flows. At a recent company we did basic manual regression tests just before deployment. These could have been easily automated via selenium and would have made sure that at least basic functionality was available. Also see this post from 2013 on smoke testing.

All of these are not really technical issues, they’re prioritization issues. At this point in time most web applications can be continuously deployed. The tooling and the knowledge is out there, given the business and technology teams commitment.

However, this in some ways sidesteps the real question. Why is continuous deployment a goal worth prioritizing, especially when the team has to spend time supporting that instead of giving customers more features? CD is extra work to set up, but once it is running then you can deliver features at a very rapid pace, and you never have a feature sitting around waiting for other orthogonal features. So, in a way, it will actually lead to more features and better development. There’s also the long term benefits of software hygiene for the ability of the system to evolve.

Software infrastructure configuration options

I ran across this great article when I was reading up on Terraform.

It does a good job of running through the options (puppet, cloudformation, etc) on how to set up your infrastructure via software. Here’s a great quote on why they chose Terraform:

On the other hand, with the kind of declarative approach used in Terraform, the code always represents the latest state of your infrastructure. At a glance, you can tell what’s currently deployed and how it’s configured, without having to worry about history or timing. This also makes it easy to create reusable code, as you don’t have to manually account for the current state of the world. Instead, you just focus on describing your desired state, and Terraform figures out how to get from one state to the other automatically.

Serverless Framework

I had coffee with an acquaintance who is doing a lot of event driven data processing. Whereas ten years ago to tackle this problem you might use an ETL tool like Pentaho or Talend, now his process runs entirely on AWS Lambda functions. He is leveraging the Serverless framework to manage and deploy these applications. As I understand it there is a thin shim layer between the business logic and the lambda event handler, but the business logic is isolated and knows nothing about its environment. That makes the business logic very testable.

His description of the Serverless framework intrigued me. As he described it, the framework is driven by a simple yaml file and takes care of, among other tasks, the complicated infrastructure set up to tie Lambda functions to a variety of AWS events. I haven’t done it myself, but I’ve heard that setting up a lambda to API Gateway link is a real bear. Doing so allows a lambda function respond to a web requests without any AWS authentication, and is a key use case.

You can write and deploy lambda functions in any language that AWS Lambda supports (unfortunately, not java 9 at the moment). Here’s a java/maven/serverless tutorial. It also supports multiple cloud providers, though I haven’t done much beyond note that the documentation exists.

However, using Serverless does require writing code. If evaluating a a complicated ETL process which non developers needed to be able to understand and support, Serverless would not be a good fit. I’m not aware of any abstraction layers on top of it, though I guess you could run, for example, Pentaho Kettle jobs within lambda. There’s also an issue around cold start times–when your code hasn’t been invoked for a while, it can take longer to start up when a request or event occurs. Apparently there are partial solutions, but your lambdas still get cycled every few hours regardless.

I worked through some of the tutorials and was impressed at just how easy it was to get started. If I had a simple API or data processing pipeline to build, Serverless would definitely be on my short list of possible implementation options. It is very inexpensive, scales easily and encourages encapsulation.

Incidentally, my acquaintance’s company is hosting a lunch and learn on this technology at the end of the month. More details here.

“The future is already here, but it’s only available as a managed AWS service”

This entire post about how Kubernetes could become the distributed operating system of choice is worth reading.  But one statement really struck me:

Well, as they say, the future is already here, but it’s only available as an AWS managed service.

The “they” in this is apparently not William Gibson, as I thought.  More details here.

For the past couple of years the cloud providers have matured and moved from offering infrastructure as a service (disk, compute) to platform as a service offerings (sqs, which is a managed message queue like activemq, or kinesis, a managed data ingestion system like kafka, etc).  Whenever you think about installing a proprietary or open source package, you should include the cloud provider offerings in your evaluation matrix.  Of course, the features you need may not be there, or the cost may be prohibitive, but including them in an evaluation makes sense because of the speed of deployment and the scaling available.

If you think a system architecture can benefit from a message queuing system, do you want to spend time setting up and maintaining such a system, or do you want to spin up an SQS queue in a few minutes?

And the cost may not be prohibitive, depending on the skillset of your internal team and your team’s desire to run such plumbing services.  It can be really hard to estimate running costs of infrastructure services, though you can estimate it by looking at internal teams and seeing similar services they run and how much money it takes.  The nice thing about cloud services is that the costs are very transparent.  The kinesis data streams pricing example walks through a scenario and concludes:

For $1.68 per day, we have a fully-managed streaming data infrastructure that enables us to continuously ingest 4MB of data per second, or 337GB of data per day in a reliable and elastic manner.

Another AWS instructor made the point that AWS and other cloud services invert the running costs of IT infrastructure.  In a typical enterprise, the running costs of your data center and infrastructure are like an iceberg–10% is explicit (server costs, electricity, etc) and 90% is implicit (payroll, time spent upgrading and integrating systems).  In the cloud world those numbers are reversed and far more of your infrastructure cost is explicitly laid out for you and your organization.

Truly the future.

The UNIX and Linux System Administration Handbook

I’ve been reading the UNIX and Linux System Administration Handbook.  It’s a real tome, with about 1500 pages.  It’s got five authors and some great cartoons, and covers everything from shell scripts to disk to email to system management daemons (check out the table of contents).  No one should ever read this book cover to cover.  That would be just silly.

I’ve been really enjoying picking and choosing chapters to read, however.  The sheer breadth of this book means that anyone with an interest in modern software development can find something useful in it.

Given my interest in AWS, I read all the sections about cloud computing.  These were high level and not super interesting to me, but I think they’d be great if you were a novice about cloud computing, and they did have a great survey of the major public cloud providers and when it made sense to use each of them.

Then I moved on to the networking sections.  I honestly can say that I didn’t understand fundamental routing protocols before I read that section.  This is obviously closer to the heart of system administration, and the authors did a great job with concepts and hands on knowledge of networking.

After that I moved on to containers.  Did you know that Docker is the new hotness?  I had heard of it, but didn’t understand why.  Now I do.  It’s hot for much the same reason as the ‘fat jar’ deployment is preferred in java land.  Having one single artifact that rolls up code and dependencies is a way to simplify deployments of production code, including rollbacks.  The authors focus on the fundamentals of containers, primarily Docker, but they also cover various orchestration layers like Mesos and Kubernetes.

I’m now in the middle of a chapter about continuous integration and continuous deployment, where they are discussing the concepts as well as Jenkins, one of the key technologies (see, I told you everyone could find something in this book).  After that, I look forward to reading about configuration management.

If you work in software at all and are involved in production systems, you’ll be able to find something in the UNIX and Linux System Administration Handbook (and if you aren’t, I’d be interested in knowing who owns that responsibility).

Restoring a single table from an Amazon RDS backup

material-icon-1307676_640When you use SQL, how do you write delete statements at the database prompt?

A delete statement typically looks like this: delete from table_name where column_name = 'foo';. I usually write it in this order:

  1. delete
  2. delete where column_name = 'foo';
  3. delete from table_name where column_name = 'foo';

Even though this is a pain because you have to move back and forth (I really need to look into vi keybindings for mysql), it prevents you from making sending this command by accident: delete from table_name; which deletes all the data in your table.  (Another alternative is to never use the interactive client and always write out your delete statements in a file and run that file to delete data.)

But, recently, I did exactly that, because I forgot.  I deleted all the data from one table in our production database.  It was billing data, so rather important.  Luckily, I am using Amazon RDS and had set up backup retention.

I wanted to outline what I did to recover from this.

  • I took a deep breath.
  • I wrote a message on the slack channel documenting what had happened and the possible customer impact.
  • Depending on which data is removed, it’s possible you will want to put the application in maintenance mode and/or inform your customers of the issues.  What I deleted was used rarely enough that I didn’t have to take these steps.
  • I looked at how to restore an Amazon RDS backup.
  • I restored the missing data.
  • I communicated that things were back to normal to internal stakeholders.

Unfortunately, it wasn’t clear how to restore a single table.  I’m used to being able to download a .sql file and hand edit it, but that’s not an option.  Stackoverflow wasn’t super helpful.   But if there’s anytime you want clarity, it’s when you are restoring production data.  You don’t want to compound the problem by screwing up something else.

So, here’s how to restore a single table from an Amazon RDS backup:

  • Note the time just before you deleted the data.  (Another reason the slack message is nice.  chatops ftw.)
  • Start up another instance from that moment.  I named it something obvious like ‘has-data-from-tablename’.
  • Twiddle your thumbs anxiously while the new instance starts up.
  • The instance is put into your default security group (as of this writing) which probably doesn’t allow mysql access.  Make sure you modify this security group to allow access.
  • When the instance is up, do a dump of the table you need: mysqldump -t --ssl-ca=./amazon-rds-ca-cert.pem -u user -ppassword -h -P3306 database_name tablename > restore-table_name.sql; (-t omits the create database/table statements.)
  • If your table is has had writes since you deleted everything, you may need to manually pull down the current data from the production system and merge it into restore-table_name.sql; I was able to avoid this step.
  • Load the data using mysql mysql --ssl-ca=./amazon-rds-ca-cert.pem -u user -ppassword -h -P3306 database_name < restore-table_name.sql;
  • Review to make sure the data is correct.
  • Test the application.
  • Update the slack channel, and do any other notifications you need to (customers, internal contacts, etc).
  • Revoke the default security group access you allowed above.
  • Delete the ‘has-data-from-tablename’ instance.

Note this only works if you caught your mistake within the backup retention window. (Make sure you set that up.)  We aren’t multi AZ or clustered, so I’m not sure how that would affect things.

Happy deep breathing!

Bare minimum of ops tasks for heroku

Awesome, you are a CTO or founding engineer of a newborn startup.  You have an web app up on Heroku and someone is paying you money for it!  Nice job.

Now, you need to think about supporting it.  Heroku makes things way easier (no racking and stacking, no purchasing hardware, no configuring apache) but you still to set up some operations.

Here is the bare minimum you need to do to make sure you can sleep at night.  (Based on a couple of years of heroku projects, and being really really cheap.)

  • Have a staging environment
    • You don’t want to push code direct to prod, do you?
    • This can be a free dyno, depending on the complexity of your app.
    • Pipelines are nice, as is preboot.
    • Cost: free
  • Have a one line deploy.
    • Or, if you like CD/CI, an automatic deploy or a one click deploy.  But make it really easy to deploy.
    • Have a deploy script that goes straight to production for emergencies.
    • Cost: free
  •  Backups
    • User data.  If you aren’t using a shared object store like S3, make sure you are doing a backup.
    • Database.  Both heroku postgresql and amazon RDS have point and click solutions.  All you have to do is set them up.  (Test them, at least once.)
    • Cost: freeish, depending on the solution.  But, user data is worth spending money on.
  • Alerting
    • Heroku has options if you are running professional dynos.
    • Uptimerobot is a great free third party service that will check ports every 5 minutes and has a variety of alert options.  If you want SMS, you have to pay for it, but it’s not outrageous.
    • Cost: free
  • Logging
    • Use a logging framework (like slf4j or the rails logger, and mark error conditions with a string that will be easy to search for.
    • Yes, you can use heroku logs but having a log management solution like papertrail will make you much happier.  Plus, it’s free for 2 days of logfiles.
    • Set up alerts with papertrail as well.  These can be more granular.
    • Cost: free
  • Create a list of third party dependencies.
    • Sign up for status alerts from these.  If you have pro slack, you can have them push an email to a channel.  If you don’t, create an alias that receives them.  You want to be the person that tells your clients about outages, not the other way around.
    • Cost: free
  • Communication
    • Internal
      • a devops_alert slack channel is my preferred solutions.  All deploys and other alerts go there.
    • External
      • create a mailing list for your clients so you can inform them of issues easily.  Google groups is fine, but use whatever other folks are using.  Don’t use an alias in your email–you’ll forget to add new clients.
      • do not use this mailing list for marketing purposes, unless you want to offload the burden of keeping the list up to date to the marketing department.
      • do make sure when you gain or lose clients you keep this up to date
    • Run through a disaster in your mind and make notes on how you would communicate the issue, both internally and externally.  How often do you update your team?  How often do you update your clients?  What about an internal issue (some of your code screwed up) vs an external issue.  This doesn’t need to be exhaustive, but thinking about it ahead of time and making some notes will help you in the crisis.
    • Cost: free

All of this is probably a four hour project, max.

But once this is done, you’ll rest easier at night, knowing you have what you need to troubleshoot and recover from production issues.

© Moore Consulting, 2003-2017 +