Rails Views Cached In Production Environment

Railroad tracksI was troubleshooting a data issue in a production environment. It wasn’t heroku, rather a rails environment hosted on AWS. It was Rails 4.2, ruby 2.2.3.

First off, it’s worth noting that there were two or three bugs that were commingled and causing issues for our client. A number of folks had spent a long time trying to troubleshoot the issue. At this point, I was tasked with taking a look and had access to all the environments. The problem only seemed to appear on production, and appeared to be a data issue. I was editing views directly on production to track down where the data issue appeared, as well as running queries on the production database and using the rails console to see what rails thought was happening. In other words, it was a hot mess. However, this debugging story isn’t the point of this post. Rather, I ran into the most peculiar situation and wanted to document it so that if I ever ran into it in the future, I would remember it.

Basically, I had a view that looked something like this:

<% cache('[key]') %>
other text
<% end %>

I changed text to be new text which included some useful debugging information. Debugged the problem and went on my merry way. The next day, early, I realized that I hadn’t changed it back, so logged back into prod and changed it back to text. Reloaded the page and didn’t see the change. What? Tried to clear the cache using the rails console and Rails.cache.delete(). No change.

After lots of googling, I realized that the view text, outside of cache tags, is cached in some other fashion. I finally figured out how to reset the cache by following these steps:

  • edit config/environments/production.rb
  • set config.cache_classes=false
  • restart passenger by touching tmp/restart.txt (see here for more on that)
  • reload the page, and now I could see text instead of new text
  • set config.cache_classes=true
  • restart passenger by touching tmp/restart.txt

This only happens when you both have a mutable production environment and are changing the view files in that environment. This won’t occur if you were using a platform like Heroku, or if you never troubleshot on production.

Load Testing Weirdness With AWS Aurora

Confused personSo I was doing a load test and saw behavior that reminded me that sometimes you just need to test.

Ran a test with 1500 requests/second with multiple servers (20ish) and smaller number of bigger servers (2-3). Saw some weird behavior with a number of 500 errors (bad gateway). Didn’t see these errors under a lower load.

Looked at the database (an aurora cluster with a single read and a single write instance) and saw that it was maxed out (cpu pegged, connections at max, couldn’t even connect at times.

Thought I need to upgrade the database. I upgraded the write instance. It was late and I failed to notice that that upgrade flipped the read and the write instances. So now the read instance was at the bigger server size and the write instance was at the smaller (original) server size. Then I re-ran the load test and everything went swimmingly (response time under 500 ms, where before it had spiked to 100 secs or more).

Great, problem solved. The larger instance size solved it.

But wait, it didn’t. The app was connecting to the primary endpoint, which is the master write node. I didn’t believe it, so I double checked and matched test times against connection spikes to the db.

So somehow, the flipping of the database to have a different primary Aurora instance (but no change in db size) caused a radical change in system behavior under heavyish loadfor a distributed php application.


Using AWS for load testing experimentation

Someone with heavy weightThe cloud is amazing for load testing your system. If you design your system to be behind a load balancer (which, in many applications, means pushing state to a database and having stateless compute nodes), you can easily switch out those nodes in different scenarios.

I just load tested a system I’m working on and changing out the compute nodes was fairly easy. Once I’d built a number of servers (something I scripted partially but didn’t fully automate because the return wasn’t there) and troubleshot some horizontal scaling issues that popped up in the application, I was able to:

  • take a server out of service behind the load balancer
  • stop it
  • change the instance type
  • start it
  • re-run any needed config changes on the server
  • update DNS if needed (depending on if you have a pinned IP address or not)
  • add it back to the load balancer

Swap out a few instances and you have a new setup for your load test. When you are done, follow the process in reverse to save yourself some money.

Incidentally, increasing the number or size of compute nodes didn’t have the desired effect of being able to handle more load.

What turned out to be the root issue? The database was pegged, both in terms of CPU and connections. Just goes to show that when you’re load testing, you really need to be looking at different aspects of the system, thinking about where your weak bottlenecks are, and use the scientific method of hypothesis, experiment, result.

Follow the money, cloud edition

Clouds in the sky

No, not that kind of cloud

This post was really eye opening and lets you know who are the real players in the public cloud space. I especially enjoyed the metric of capex as percent of revenue. From the post:

As I keep repeating, CAPEX is both a prerequisite to play in the big boy cloud and confirmation of customer success. Both IBM and Oracle are tens of billions of dollars in cloud infrastructure CAPEX behind Amazon, Google, and Microsoft. Oracle’s spending has at least ticked up, but their spending is not enough to keep pace, much less to have any hope of catching up to the infrastructure of the big three.

The whole post is worth reading if you are interested in public cloud providers in any way.

Fixing the RubyGems “Too Many Requests 429” error

lots of gemsA server on which I am working runs this command: /usr/bin/gem install --no-rdoc --no-ri aws-sdk to get the aws-sdk. I was seeing this error message:

Error: Execution of '/usr/bin/gem install --no-rdoc --no-ri aws-sdk' returned 1: ERROR:  While executing gem ... (Gem::RemoteFetcher::FetchError)
    bad response Too Many Requests 429 (https://api.rubygems.org/api/v1/dependencies?gems=aws-sdk-elasticloadbalancingv2)

Every time I ran it I’d see a different gem that triggered the 429 response. There wasn’t much out there when searching, other than a note that I should update to a new version of bundler (which I wasn’t using).

Finally, I figured out how to get past this. What I did was manually run /usr/bin/gem install -f --no-rdoc --no-ri aws-sdk multiple times, and each time the command would get a little further. Finally all the dependencies had been downloaded. Then I was able to run it without the -f switch after that.

Obstacles to building high availability software systems

Open sign

Is your system available?

I saw a discussion on a slack about obstacles to high availability systems and wanted to record the edited version for posterity (mostly for future me, as I blog for myself). Note that in any mention of high availability systems would be remiss if I didn’t mention the Google SRE book, which is slow reading but free and full of great information.

First, what is high availability? I like this definition from Digital Ocean:

In computing, the term availability is used to describe the period of time when a service is available, as well as the time required by a system to respond to a request made by a user. High availability is a quality of a system or component that assures a high level of operational performance for a given period of time.

Design considerations of a system that will hinder high availability fall into two categories.

The first category is actions that you don’t take, but could take:

  • single points of failure: if you have a piece of your system which is unique and it fails (and everything fails, all the time), the entire system’s availability will be affected.
  • missing or incomplete automation: if you need human beings to resurrect failed parts of your system, it will meaningful amounts of time and will be error prone.
  • failing to build in elasticity and scalability of resources: when usage increases, new resources should be automatically brought online. Failure to do so will impact system performance and that could impact system availability
  • missing or incomplete system instrumentation: if you don’t monitor your system, you won’t be able to even know its availability (until you hear from your users).
  • application statefulness (on the compute nodes): this impacts your ability to use elastic resources and to grow parts of your system that are under load. (If you aren’t designing a greenfield system, this may be an externally imposed requirement due to existing software.)

The second is in actions you can’t take because of external requirements on the system:

  • data sovereignty: if you are legally limited to certain data centers, you have fewer options for your system, this can hinder building the system.
  • tenancy: if you need to have single tenancy for security or legal reasons, you may have fewer options for elastic solutions.
  • data models and authority requirements: poorly performing data models can impact performance. If your application requires certain operations must be from the source of record (permissions checks, for example) then a poorly performing source data model can impact performance which can impact availability.
  • latency: if you have a highly latency sensitive system, then you may need to trade availability for decreased latency. Since availability often means geographic dispersion (to avoid disasters impacting multiple pieces of a system), it impacts latency requirements.
  • cost: high availability systems, because they have no single points of failure, cost more.

Again, this was a discussion from a slack of AWS instructors, but the commentary is mine, as are any mistakes. Thanks to Chad, Richard, Jon, Ryan and everyone else!

Let AWS RDS handle database scutwork

Amazon DatabasesRDS is a service I’ve mentioned in the past, but it’s fantastic. You can outsource large chunks of database administration to AWS. Tasks you can forget about include backups, failover, read only replicas, and OS and DB upgrades.

This is a great fit for spinning up databases for small scale to large scale systems and prototyping.

Things to keep in mind if you start using RDS:

  • The database is launched into a VPC and will have a security group around it. You’ll need to allow IP addresses or security groups access to the port the database is living on or your connections will time out.
  • The database RDS creates is a normal database that you can manage like you can any other database you have set up and installed, but there are certain limitations (for example, no MySQL UDFs). Read the documentation and understand the limitations, but be aware they are constantly changing. I suggest subscribing to the AWS Database blog RDS category for updates.
  • RDS uses EBS under the covers and has the performance constraints of that technology. For the largest scale production systems you’ll want to test before jumping in whole hog.
  • If you are using MySQL or PostgreSQL and are running into concurrency problems, Aurora may be worth evaluating.
  • If you want to have backups past thirty five days for peace of mind of compliance concerns, you’ll need manual snapshots.
  • RDS only supports certain RDBMS and limits databases to certain sizes. If you want to run anything else on AWS, you will need to self manage your DB on EC2 or look at other data management solutions. Here are some other gotchas.
  • When using RDS you aren’t freed from all database administration tasks. There are still users to manage, indices to add, and queries to tune. Most of your RDBMS skillset is applicable to RDS, however. You’ll also need to determine when to schedule DB and OS upgrades, backups and how to size your instances. You still need to set up the optimal architecture of an RDS system including standbys and read only replicas and do other configuration both at the network and database level.
  • You can manage RDS system attributes via cloudformation, terraform and the CLI in the same way you can manage other AWS infrastructure. That said, the RDS system is stateful so you can’t treat it entirely as “cattle”.

You can learn more about RDS in the extensive documentation.

AWS Quick Starts

StopwatchIf you are looking to stand up an application quickly, I often recommend the AWS marketplace. This service has thousands of vendor maintained solutions and is a great way to get going quickly. Note that some of the solutions have extra per hour charges, and if that is the case per second billing won’t apply. These solutions are focused on individual AWS EC2 instance images (so you can quickly stand up a phpbb instance or a redmine server, for example).

However, another good option is AWS Quick Starts. These are recipes for deployments, possibly of multiple virtual machines, and are aimed at handling larger business problems. There are over 80 listed on the Quick Start page right now, ranging from creating a data lake to a HIPAA reference architecture to running devops tools like consul and bitbucket. These solutions may or may not carry additional charges, so make sure review licensing and billing information as well as functionality.

If you are thinking about setting up a complex system in AWS, it’s worth some time to see if someone has put a reference Quick Start together. It may not fit your needs perfectly, but can be a good place to begin.

AWS documentation now open source and on Github

TypewriterThis was announced recently. The AWS docs are now available on Github for everyone to review and improve. I love documentation (have for years). I think it’s great that AWS is now allowing PRs against their documentation. Some products have not yet uploaded their docs, ahem. It can only improve the speed of change.

I think it will also give a good glimpse into usage stats of AWS services. If a service doesn’t have any PRs or issues opened, it’s unlikely to be widely used (or, alternatively, it could be totally stable, or have users that don’t use Github). It’d be a fun project to pull the number of contributions to these repos via the Github API and publish that data.

I still feel that guides like og-aws have a place in the world of AWS documentation–opinions and real world stories fit better there than they do in official AWS documentation. And this is still too new to know if PRs and fixes will be pulled into the docs in a timely manner. But it’s great to see the AWS teams experimenting with ways to improve their documentation at scale.

Software infrastructure configuration options

I ran across this great article when I was reading up on Terraform.

It does a good job of running through the options (puppet, cloudformation, etc) on how to set up your infrastructure via software. Here’s a great quote on why they chose Terraform:

On the other hand, with the kind of declarative approach used in Terraform, the code always represents the latest state of your infrastructure. At a glance, you can tell what’s currently deployed and how it’s configured, without having to worry about history or timing. This also makes it easy to create reusable code, as you don’t have to manually account for the current state of the world. Instead, you just focus on describing your desired state, and Terraform figures out how to get from one state to the other automatically.

© Moore Consulting, 2003-2017 +