Skip to content

How to solve the “this application is slow” type of problem

How do you feel when your boss says “the application is slow, please speed it up”?

Personally, my heart sinks and then I get excited. Though they can be frustrating, this kind of thorny performance issue is fun if you look at it the right way.

Whenever I’ve tackled problems like this, the first thing to do is define the start and the finish line, as precisely as you can.

This requires understanding the application’s behavior and architecture. Where is the data stored? How is it presented? How is it modified? Are there types of operations that happen regularly? What exactly is “slow” about the application?

Try to avoid jumping to conclusions here. I understand the temptation to make a change as soon as you think of one that might help, but it’s better to approach this systematically.

Suppose you find out the issue is the database. Operations are too slow and the CPU is not pegged. You know the type and version of the database, how the application calls it, and more.

Here, a good finish line might be “we need to be able to handle updating the main table with 50k items in 1 second”. If you don’t have a precise finish line, this type of work can be endless and frustrating. After all, it is almost always possible to “make it faster”, but you will reach the point of diminishing returns.

If possible, set up a test scenario/system that you can run through repeatedly as you make changes. If not, figure out some other way to test that changes have a positive impact.

Next brainstorm possible solutions and think of two numbers for each: level of effort and hoped improvement. Doesn’t need to be too precise, a scale of 1-5 is fine. Here’s an example for a database bottleneck:

  • upgrade the size of the database. LOE low, impact medium
  • increase the disk speed: LOW low, impact high
  • running explain plans and adding suggested indices: LOE medium, impact high
  • offloading operations to read replicas: LOE high, impact high
  • …etc, etc

Then start doing low effort, high impact changes. Run through your scenario and tests after each one. See if you get closer to the finish line. Rinse and repeat.

This type of performance issue is a case where hiring an outside consultant/contractor can make sense. You don’t have to spend a lot, since the scope of work can be limited. They can work with you to define the start and finish lines as well as possible steps if you don’t have the time or knowledge to do so. Then have an internal team take the specific actions and test each change to see if it helps.

When in doubt, test it out

When I taught AWS certification courses, I’d often get questions about how a service behaved under load or other unusual circumstances. Frequently I could answer from personal experience or by asking other instructors; occasionally class members provided their insights. Sometimes I could dig up relevant vendor documentation.

However, my default answer was:

“Test it for yourself. There’s no substitute for testing.”

This is one of the great advantages of the cloud. When you have a question about the performance or behavior of a service or system, spin it up and test it. This will cost you money and some time configuring the system, but certainly will be cheaper than ordering hardware, racking it and then also configuring the system. When you’re done with your testing, you can tear down the infrastructure and never worry about it again. Sure beats shipping a server back to the manufacturer.

Of course, no testing scenario can replicate production perfectly. But you can get pretty close (especially if you can reuse production traffic).

When you do test, start by documenting what you want to achieve. What is the question you are trying to answer? Make sure to seek feedback from other team members and/or search online, as it’s possible someone has already answered your question. If you do find answers, understand under what circumstances the tests were performed, as the cloud and the offered services change over time.

Some examples of cloud infrastructure questions you might want to answer:

  • How do EBS volumes of different sizes and types perform under load?
  • When a Kubernetes cluster running on GKE is under load, what happens when you add an additional node? An additional pod?
  • What happens when you turn off a NAT gateway while a file is being uploaded to S3 from an EC2 instance in a private subnet (without an S3 VPC endpoint)?
  • What is the cold start time for an empty Azure function? What about a function loading your dlls?

Think about what steps you are going to take to try to answer the question.

With your question and methodology spelled out, spin up your testing environment. Having your infrastructure represented as code will make this quick, especially if you have a complicated environment. If you are creating the test environment manually, record settings and other configuration in a text file to be able to re-create the environment later.

Run your tests. If you are load testing, find an open source or commercial load testing tool. What you need depends on your goals: you need a different tool to test 100k+ simultaneous users on a website than you do when trying to understand how an internal API handles 100 requests/second.

Review the data to see if your questions are answered. More questions or areas of interest may appear. Adjust your tests to answer them.

Once you have your answers to the desired level of certainty, tear down your testing infrastructure.

Document what you tested, how you tested and your results. Circulate this internally to help your team. If possible, publish it on your company blog to both help others in the same boat and to boost your company’s standing in the community.

All the vendor documentation in the world is no substitute for rolling up your sleeves and testing.

Load Testing Weirdness With AWS Aurora

Confused personSo I was doing a load test and saw behavior that reminded me that sometimes you just need to test.

Ran a test with 1500 requests/second with multiple servers (20ish) and smaller number of bigger servers (2-3). Saw some weird behavior with a number of 500 errors (bad gateway). Didn’t see these errors under a lower load.

Looked at the database (an aurora cluster with a single read and a single write instance) and saw that it was maxed out (cpu pegged, connections at max, couldn’t even connect at times.

Thought I need to upgrade the database. I upgraded the write instance. It was late and I failed to notice that that upgrade flipped the read and the write instances. So now the read instance was at the bigger server size and the write instance was at the smaller (original) server size. Then I re-ran the load test and everything went swimmingly (response time under 500 ms, where before it had spiked to 100 secs or more).

Great, problem solved. The larger instance size solved it.

But wait, it didn’t. The app was connecting to the primary endpoint, which is the master write node. I didn’t believe it, so I double checked and matched test times against connection spikes to the db.

So somehow, the flipping of the database to have a different primary Aurora instance (but no change in db size) caused a radical change in system behavior under heavyish loadfor a distributed php application.

Mysteries.

Who’s Afraid of Continuous Deployment?

Fish leaping to a larger pool
Leaping to larger pool

So, who’s afraid of continuous deployment? I am, for one. And I’m not alone. I taught hundreds of people in AWS courses over the past two years. We often discussed continuous delivery and deployment and I asked if this was practiced at their places of work. I’d say about 5-10% of folks said yes. I conducted a very informal survey across two technical slacks as well. Unfortunately I had my terms wrong and asked about continuous delivery:

Wanted to do a quick poll. Can you please give a thumbs up to this message if you or your team does continuous delivery of your software product, and a thumbs down if you don’t. And a :penguin: if it doesn’t apply?

The results were:

  • Did CD: 27
  • Did not do CD: 25
  • Does not apply: 3

In the poll, I defined continuous delivery as “if a change is merged to the mainline branch and passes all the tests, it is deployed to production (or whatever environment your customers see) without human involvement”. This was actually a source of discussion, as some folks were very close to this (they deployed to beta environments where only a few customers saw it, or required one human to push a button to actually release, but everything up to that point was automated). Also, someone shared this link about the difference between continuous delivery and continuous deployment. Turns out I was using the term continuous delivery incorrectly. What I defined as continuous delivery was actually continuous deployment. Whoops!

That said, it was interesting that a large number of folks did not deploy code automatically, almost half (note that I believe the poll had a bias because I asked in one slack on the #devops channel. The numbers from the other slack had less than half doing continuous deployment). I’ve worked at a number of small startups, some without paying customers, and I’ve never worked in a place with continuous deployment. I’ve been in jobs with continuous integration and continuous delivery (and this provides a lot of value) but not continuous deployment. I wanted to talk about some reasons why.

The first reason is that continuous deployment simply doesn’t apply. If you are building software that is deployed to customer sites (on-prem), or is tied to hardware, then it doesn’t make sense to work toward CD because there will always be a manual delivery component. Another reason why it might not apply is legal compliance. Folks in the slacks pointed out that in some regulatory regimes you legally are required to have a human ‘push a button’ to deploy because more than one person needed to be involved in a code deploy to satisfy the law and the auditors. These are totally legitimate reasons for not doing continuous deployment.

Next, let’s discuss the reasons based on fear or lack of software hygiene (automated tests or a robust type system). Before I step into this, I want to acknowledge that there may be times in the life of your business where such software hygiene is detrimental to your chances of survival–you need to get an MVP out and test your value in the market, for example. However, in my years of experience I find that following proper software hygiene is far easier to do if adhered to from the beginning. If you don’t, eventually the difficulty of changing the system will grow along with its complexity. You can bolt on testing later, but it is difficult.

I also want to emphasize that I’ve been in all these situations myself. In some ways this blog post is a warning for future me when I try to shirk these practices.

  • If you don’t have automated test coverage, continuous deployment is reckless. This often happens in systems where the testing was bolted on after the system had been developed for a while. The solution is to work towards having enough test coverage to give yourself confidence (it swaddles your code).
  • A system may have configuration deeply tied to a database. Many content management systems are in this boat, which makes it very difficult to roll new configuration forward automatically.
  • Not having an automated rollback strategy. If you are going to continuously deploy, you need to have a way to rollback with confidence, with one script. If you are on heroku, heroku rollbacks help here. If you are running rails code, you can use db:rollback but you’ll need to know how many steps to rollback (I couldn’t find anything that rolled all migrations back to a given timestamp) and you’ll want to be careful about losing data. It may make more sense to run migrations in a different release, and always have the code be backward compatible. Lots of interesting reading about that strategy in strong_migration’s docs. This solution will vary from application to application.
  • Not having enough users to safely canary. One way to know if your new release has problems is to do a blue/green deployment and send just a fraction of your traffic there (you could use a weighted DNS round robin solution). But if you only have a small number of users, the canary userbase won’t adequately run through all the code paths.
  • Fear of breaking key user flows. At a recent company we did basic manual regression tests just before deployment. These could have been easily automated via selenium and would have made sure that at least basic functionality was available. Also see this post from 2013 on smoke testing.

All of these are not really technical issues, they’re prioritization issues. At this point in time most web applications can be continuously deployed. The tooling and the knowledge is out there, given the business and technology teams commitment.

However, this in some ways sidesteps the real question. Why is continuous deployment a goal worth prioritizing, especially when the team has to spend time supporting that instead of giving customers more features? CD is extra work to set up, but once it is running then you can deliver features at a very rapid pace, and you never have a feature sitting around waiting for other orthogonal features. So, in a way, it will actually lead to more features and better development. There’s also the long term benefits of software hygiene for the ability of the system to evolve.

Pact Testing

PadlockI attended the Google Developer Group meetup last week and enjoyed many of the talks. It was a lightning session, so there were ten speakers. In particular I really enjoyed “Pact Contract Testing” by Claire Chen. The idea behind Pact Testing, which has been around since 2013 and has had four major specification releases, is to formalize the contract between an API consumer and producer and allow each side of the API conversation to be developed independently. You can record the interactions between each consumer and producer and re-play them during testing to verify that no regressions have occurred. It’s really designed for a situation where you control both the consumer and the producer and want to verify that there are no breaking changes when either of them evolve.

So, this seems like mocks and stubs on steroids with the additional benefit of being cross platform (many languages are supported) and exercising the entire producer or consumer independently. You can also run an external server to maintain all the pacts independently.

If you are running a microservices architecture, I’d strongly recommend taking a look at this. Next time I’m involved in an API consumer/producer project, I’ll definitely be using this, and will report back then.

See also “convince me that Pact Testing is a good idea” and “what is Pact not good for?”.

Always break rails migrations into smallest chunks possible, and other lessons learned

So this was a bit of a sticky wicket that I recently extracted myself from and I wanted to make notes so I didn’t make the same mistake again. I was adding a new table that related two existing tables and added the following code

class CreateTfcListingPeople < ActiveRecord::Migration
  def change
    create_table :tfc_listing_people do |t|
      t.integer :listing_id, index: true
      t.string :person_id, limit: 22, index: true

      t.timestamps null: false
    end

    add_foreign_key :tfc_listing_people, :people
    add_foreign_key :tfc_listing_people, :listings

  end
end

However, I didn’t notice that the datatype of the person.id column (which is a varchar) was `id` varchar(22) COLLATE utf8_unicode_ci NOT NULL

This led to the following error popping up in one of the non production environments:

2018-02-27T17:10:05.277434+00:00 app[web.1]: App 132 stdout: ActionView::Template::Error (Mysql2::Error: Illegal mix of collations (utf8_unicode_ci,IMPLICIT) and (utf8_general_ci,IMPLICIT) for operation '=': SELECT COUNT(*) FROM `people` INNER JOIN `tfc_listing_people` ON `people`.`id` = `tfc_listing_people`.`person_id` WHERE `tfc_listing_people`.`listing_id` = 42):

I was able to fix this with the following alter statement (from this SO post): ALTER TABLE `tfc_listing_people` CHANGE `person_id` `person_id` VARCHAR( 22 ) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL.

But in other environments, there was no runtime error. There was, however, a partially failed migration, that had been masked by some other test failures and some process failures, since there was a team handoff that masked it. The create table statement had succeeded, but the add_foreign_key :tfc_listing_people, :people migration had failed.

I ran this migration statement a few times (pointer on how to do that): ActiveRecord::Migration.add_foreign_key :tfc_listing_people, :people and, via this SO answer, I was able to find the latest foreign key error message:

2018-03-06 13:23:29 0x2b1565330700 Error in foreign key constraint of table sharetribe_production/#sql-2c93_4a44d:
 FOREIGN KEY (person_id)  REFERENCES people (id): Cannot find an index in the referenced table where the
referenced columns appear as the first columns, or column types in the table and the referenced table do not match for constraint. Note that the internal storage type of ENUM and SET changed in tables created with >= InnoDB-4.1.12, and such columns in old tables cannot be referenced by such columns in new tables.
Please refer to http://dev.mysql.com/doc/refman/5.7/en/innodb-foreign-key-constraints.html for correct foreign key definition.

So, again, just running the alter statement to change the collation of the tfc_listing_people table worked fine. However, while I could handcraft the fix on both staging and production and did so, I needed a way to have this change captured in a migration or two. I split apart the first migration into two migrations. The first created the tfc_listing_people table, and the second looked like this:

class ModifyTfcListingPeople < ActiveRecord::Migration
  def up
    execute <<-SQL
      ALTER TABLE  `tfc_listing_people` CHANGE  `person_id`  `person_id` VARCHAR( 22 ) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL
    SQL

    add_foreign_key :tfc_listing_people, :people
    add_foreign_key :tfc_listing_people, :listings
  end
  def down
    drop_foreign_key :tfc_listing_people, :people
    drop_foreign_key :tfc_listing_people, :listings
  end
end

Because I’d hand crafted the fixes on staging and production, I manually inserted a value for this migration into the schema_migrations table to indicate that the migration had been run in those environments. If I hadn’t had two related but different migration actions, I might not have had to go through these manual gyrations.

My lessons from this episode:

  • pay close attention to any errors and failed tests, no matter how innocuous. This is a variation of the “broken window theory”
  • break migrations into small pieces, which are easier to debug and to migrate back and forth
  • knowing SQL and having an understanding of how database migrations work (they are cool, but they aren’t magic, and sometimes they leak) was crucial to debugging this issue

Speed up development by catching your mail locally

Have you ever been developing some kind of application that sends email? You need to test how the email looks, so you have to have access to an external SMTP server and you have to configure your application to use that. You can definitely set up sendgrid or another MTA to send email from your local computer and then use a real email address as your target. However, then to develop this portion of the application you need to be online.

Another option that I’ve found is the Mailcatcher gem. This is a small ruby program that you can easily configure as your SMTP endpoint. Then when your development environment sends mail, mailcatcher catches it. Then you can visit a URL on your local computer and view received emails. As soon as mailcatcher shuts down, the emails are lost, however.

Even though this is a ruby gem, you can use the app with different languages–as long as it you can configure the application to point to an SMTP server, you’re good (in the readme, there are examples for Django and PHP).

One note about it being a gem. Don’t put it in your Gemfile if you are building a rails app, because of possible conflicts. This means that if you manage your ruby environments via rvm you’ll need to re-install mailcatcher every time you change your ruby version.

Bonus: mailcatcher even has an API so you can use it in your integration test environment to verify that certain actions in your application caused certain emails to be sent.

Serverless Framework

I had coffee with an acquaintance who is doing a lot of event driven data processing. Whereas ten years ago to tackle this problem you might use an ETL tool like Pentaho or Talend, now his process runs entirely on AWS Lambda functions. He is leveraging the Serverless framework to manage and deploy these applications. As I understand it there is a thin shim layer between the business logic and the lambda event handler, but the business logic is isolated and knows nothing about its environment. That makes the business logic very testable.

His description of the Serverless framework intrigued me. As he described it, the framework is driven by a simple yaml file and takes care of, among other tasks, the complicated infrastructure set up to tie Lambda functions to a variety of AWS events. I haven’t done it myself, but I’ve heard that setting up a lambda to API Gateway link is a real bear. Doing so allows a lambda function respond to a web requests without any AWS authentication, and is a key use case.

You can write and deploy lambda functions in any language that AWS Lambda supports (unfortunately, not java 9 at the moment). Here’s a java/maven/serverless tutorial. It also supports multiple cloud providers, though I haven’t done much beyond note that the documentation exists.

However, using Serverless does require writing code. If evaluating a a complicated ETL process which non developers needed to be able to understand and support, Serverless would not be a good fit. I’m not aware of any abstraction layers on top of it, though I guess you could run, for example, Pentaho Kettle jobs within lambda. There’s also an issue around cold start times–when your code hasn’t been invoked for a while, it can take longer to start up when a request or event occurs. Apparently there are partial solutions, but your lambdas still get cycled every few hours regardless.

I worked through some of the tutorials and was impressed at just how easy it was to get started. If I had a simple API or data processing pipeline to build, Serverless would definitely be on my short list of possible implementation options. It is very inexpensive, scales easily and encourages encapsulation.

Incidentally, my acquaintance’s company is hosting a lunch and learn on this technology at the end of the month. More details here.

The power of automated testing

It took me a long time to understand the power of automated testing.  After all, it can end up being a large portion of your codebase and can be brittle.  Sometimes it feels like writing tests “gets in the way” of getting things done.  At one project I worked on, a colleague complained that it felt like you spent 5 minutes changing the production code and an hour changing the tests.  (And to be fair, sometimes that’s true, and there’s a balance to be struck between test code coverage and speed of development.  This can also indicate you need to spend time refactoring your tests, as you have multiple different test components testing the same production code.)

I like to think of tests like a gentle swaddling of your code.  It conforms as the body of your code changes, but changing that code does require some re-work of the tests.  And, if your code fails, it fails into the gentle swaddling, as opposed to the cruel outside world (bleeding all over your production users).  Alright, maybe the analogy fails :).

I write this today because I’m in the middle of a refactor of one of the scariest bits of The Food Corridor.  (Given we’re so young, it’s not that scary, but it’s quite complex–handling the creation and updating of bookings.)  There are many many paths through the code and if I didn’t have automated testing, I’d be far more worried about the changes I’m making.

So, consider this blog post to be a thank you to past me for making future me’s life easier by writing a comprehensive automated test suite.  If you don’t have one, you should.

Leverage

As a software developer, and especially as a senior (expensive) software developer, you need leverage. Leverage makes you more productive.  I find it also makes the job more fun.

Some forms of leverage:

  • test suite. A suite provides leverage both by serving as a living form of documentation (allowing others to understand the code) and a regression suite so that changes to underlying code can be made with assurances that external code behavior don’t happen.
  • libraries and frameworks, like Rails. By solving common problems, libraries, open source or not, can accelerate the building of your product. Depending on the maturity of the library or framework, they may cover edge cases that you would have to discover via user feedback.
  • iaas solutions, like AWS EC2. By giving you IT infrastructure that you can manipulate via software, you can apply software engineering techniques to ensure validity of your infrastructure and make deployments replicable.
  • paas solutions, like Heroku. These may force your application to conform to certain limitations, but take a whole host of operations tasks off your plate (deployments, patching servers). When a new bug comes out affecting Nginx, you don’t have to spend time checking your servers–your provider does. When you reach a certain scale, thost limitations may come back to bite you, but in the early days of a project or company, having them off your plate allow you to focus on business logic.
  • saas solutions, like Google Apps or Delighted. You can have an entire business solution available for a monthly fee. These can be large in scope, like Google Apps, or small in scope, like Delighted, but either way they solve an entire business problem. You can trade time for money.
  • experience, aka the mistakes you’ve made on someone else’s dime. This allows you leverage by pruning the universe of possibilities for solving problems, based on what’s worked in the past. You don’t spend time doing exploration or spikes. Note that experience may guide you toward or away from any of the points of leverages mentioned above. And that experience needs to be tempered with learning, as the software world changes.
  • team. There’s only so much software you can write yourself. A team can help, both in terms of executing against a software design/architecture and improving it via their own experience.

Leverage allows you to be more productive and the more experienced you get, the more you should seek it.