Load Testing Weirdness With AWS Aurora

Confused personSo I was doing a load test and saw behavior that reminded me that sometimes you just need to test.

Ran a test with 1500 requests/second with multiple servers (20ish) and smaller number of bigger servers (2-3). Saw some weird behavior with a number of 500 errors (bad gateway). Didn’t see these errors under a lower load.

Looked at the database (an aurora cluster with a single read and a single write instance) and saw that it was maxed out (cpu pegged, connections at max, couldn’t even connect at times.

Thought I need to upgrade the database. I upgraded the write instance. It was late and I failed to notice that that upgrade flipped the read and the write instances. So now the read instance was at the bigger server size and the write instance was at the smaller (original) server size. Then I re-ran the load test and everything went swimmingly (response time under 500 ms, where before it had spiked to 100 secs or more).

Great, problem solved. The larger instance size solved it.

But wait, it didn’t. The app was connecting to the primary endpoint, which is the master write node. I didn’t believe it, so I double checked and matched test times against connection spikes to the db.

So somehow, the flipping of the database to have a different primary Aurora instance (but no change in db size) caused a radical change in system behavior under heavyish loadfor a distributed php application.

Mysteries.


Using AWS for load testing experimentation

Someone with heavy weightThe cloud is amazing for load testing your system. If you design your system to be behind a load balancer (which, in many applications, means pushing state to a database and having stateless compute nodes), you can easily switch out those nodes in different scenarios.

I just load tested a system I’m working on and changing out the compute nodes was fairly easy. Once I’d built a number of servers (something I scripted partially but didn’t fully automate because the return wasn’t there) and troubleshot some horizontal scaling issues that popped up in the application, I was able to:

  • take a server out of service behind the load balancer
  • stop it
  • change the instance type
  • start it
  • re-run any needed config changes on the server
  • update DNS if needed (depending on if you have a pinned IP address or not)
  • add it back to the load balancer

Swap out a few instances and you have a new setup for your load test. When you are done, follow the process in reverse to save yourself some money.

Incidentally, increasing the number or size of compute nodes didn’t have the desired effect of being able to handle more load.

What turned out to be the root issue? The database was pegged, both in terms of CPU and connections. Just goes to show that when you’re load testing, you really need to be looking at different aspects of the system, thinking about where your weak bottlenecks are, and use the scientific method of hypothesis, experiment, result.


Follow the money, cloud edition

Clouds in the sky

No, not that kind of cloud

This post was really eye opening and lets you know who are the real players in the public cloud space. I especially enjoyed the metric of capex as percent of revenue. From the post:

As I keep repeating, CAPEX is both a prerequisite to play in the big boy cloud and confirmation of customer success. Both IBM and Oracle are tens of billions of dollars in cloud infrastructure CAPEX behind Amazon, Google, and Microsoft. Oracle’s spending has at least ticked up, but their spending is not enough to keep pace, much less to have any hope of catching up to the infrastructure of the big three.

The whole post is worth reading if you are interested in public cloud providers in any way.


Farewell Boulder New Tech Meetup

Girl letting go of balloonI, along with many others, received this email last week:

Participating and watching the Colorado tech community evolve has been an amazing experience. Over the past 12 years we have had so many people engage and support our efforts. This includes the attendee’s, presenters, organizers, and sponsors. Give yourselves a big hand, IMO you are the reason Colorado has such a vibrant startup ecosystem.

I’m saddened, but it is time for me step away and stop organizing/hosting the Boulder chapter of New Tech Colorado. I’ve attempted to find a replacement over the past year, but no one has stepped up. I think thats ok, we have many other pitch events happening throughout the front range, including other New Tech events.

The Boulder New Tech, starting with the June event will no longer accept reservations and I’m going to shut down BDNT.org.

Thank you for giving me your attention and for sharing your experiences over the past 12 years. See you around town.

It was from Robert Reich, the moving force behind the New Tech Meetups here in Colorado. After over a decade, there will be no more Boulder meetups (though it looks like other cities are going strong, at least from the meetup page). I totally understand where Robert is coming from. I’ve been to many of these meetups, but over the last couple of years attendance was definitely down. However, every time I attended I met interesting people and saw a different slice of the Boulder ecosystem. I will say that it seemed like BDNT was a welcoming initial introduction to that ecosystem, but once a newcomer understood the landscape, I think they were better served by a more focused meetup. I know Robert experimented with a number of different formats and concepts–I hope he writes them all down for future meetup organizers.

I also had the opportunity to speak a few times at BDNT. Once I presented on GWT, which was my first experience talking to a group of over one hundred people (note to self, don’t present a technology at a pitch night 🙂 ). I also spoke in Denver with Brian Timoney–that was fun because of the 3d google earth submarine navigation demo and because Josh Fraser met with us and gave us some tips. And in 2016 I presented the startup of which I was a co-founder. Each time the community was very supportive and helpful.

I want to thank a few folks:

  • Robert and all the other organizers over the years.
  • The speakers, who made the night interesting every time I went. I never left without learning something new.
  • The hosts. I know the event was hosted for many years by Silicon Flatirons, but also attended events at Galvanize Boulder.Also appreciated the snacks!
  • The community. Always supportive and present.

Institutions don’t have to live forever (especially those that survive on the efforts of volunteers). It’s OK for them to end. I will miss the BDNT event, but I know the community of support for entrepreneurship in Boulder and the front range is larger than ever.

Fare thee well, Boulder New Tech.


Greenfield webapp data storage decision tree

Choice on a sign

Choices choices

I saw a discussion about storing data in one of my slack channels and saw a line too good to leave in the slack.

Here’s a data storage decision tree for 95% of applications. Do you have data to durably store?

1. Use an open source relational database

(The original poster specified PostgreSQL, but MySQL/MariaDB are viable alternatives in my mind. Each has different strengths.)

The modern, open source RDBMS is flexible and scaleable (even web scale [warning, video has cursing]). It’s free from licensing fees (though you can pay for support). There are turnkey solutions for managing it in the cloud. There are plenty of developers who know how to use it (even more developers know how to use SQL) and many DBAs and sysadmins who know how to tune it. . You can scale it out by using read replicas and scale up by buying better hardware (or VMs). Every language has a library that talks to RDBMS. The database will maintain data integrity. Many tools that non technical users favor can connect to them (even Excel, if you install the right ODBC driver).

There are plenty of other solutions out there (filesystem, no SQL variants, xml databases, data lakes). They exist for a reason. For certain problems and at certain scale they are better solutions than the Swiss army knife of a RDBMS. But the default decision should always be an RDBMS, and the onus should be on the other solution to justify its present. For 95% of problems, the your default should be MySQL/MariaDB or Postgres.


Fixing the RubyGems “Too Many Requests 429” error

lots of gemsA server on which I am working runs this command: /usr/bin/gem install --no-rdoc --no-ri aws-sdk to get the aws-sdk. I was seeing this error message:

Error: Execution of '/usr/bin/gem install --no-rdoc --no-ri aws-sdk' returned 1: ERROR:  While executing gem ... (Gem::RemoteFetcher::FetchError)
    bad response Too Many Requests 429 (https://api.rubygems.org/api/v1/dependencies?gems=aws-sdk-elasticloadbalancingv2)

Every time I ran it I’d see a different gem that triggered the 429 response. There wasn’t much out there when searching, other than a note that I should update to a new version of bundler (which I wasn’t using).

Finally, I figured out how to get past this. What I did was manually run /usr/bin/gem install -f --no-rdoc --no-ri aws-sdk multiple times, and each time the command would get a little further. Finally all the dependencies had been downloaded. Then I was able to run it without the -f switch after that.


Obstacles to building high availability software systems

Open sign

Is your system available?

I saw a discussion on a slack about obstacles to high availability systems and wanted to record the edited version for posterity (mostly for future me, as I blog for myself). Note that in any mention of high availability systems would be remiss if I didn’t mention the Google SRE book, which is slow reading but free and full of great information.

First, what is high availability? I like this definition from Digital Ocean:

In computing, the term availability is used to describe the period of time when a service is available, as well as the time required by a system to respond to a request made by a user. High availability is a quality of a system or component that assures a high level of operational performance for a given period of time.

Design considerations of a system that will hinder high availability fall into two categories.

The first category is actions that you don’t take, but could take:

  • single points of failure: if you have a piece of your system which is unique and it fails (and everything fails, all the time), the entire system’s availability will be affected.
  • missing or incomplete automation: if you need human beings to resurrect failed parts of your system, it will meaningful amounts of time and will be error prone.
  • failing to build in elasticity and scalability of resources: when usage increases, new resources should be automatically brought online. Failure to do so will impact system performance and that could impact system availability
  • missing or incomplete system instrumentation: if you don’t monitor your system, you won’t be able to even know its availability (until you hear from your users).
  • application statefulness (on the compute nodes): this impacts your ability to use elastic resources and to grow parts of your system that are under load. (If you aren’t designing a greenfield system, this may be an externally imposed requirement due to existing software.)

The second is in actions you can’t take because of external requirements on the system:

  • data sovereignty: if you are legally limited to certain data centers, you have fewer options for your system, this can hinder building the system.
  • tenancy: if you need to have single tenancy for security or legal reasons, you may have fewer options for elastic solutions.
  • data models and authority requirements: poorly performing data models can impact performance. If your application requires certain operations must be from the source of record (permissions checks, for example) then a poorly performing source data model can impact performance which can impact availability.
  • latency: if you have a highly latency sensitive system, then you may need to trade availability for decreased latency. Since availability often means geographic dispersion (to avoid disasters impacting multiple pieces of a system), it impacts latency requirements.
  • cost: high availability systems, because they have no single points of failure, cost more.

Again, this was a discussion from a slack of AWS instructors, but the commentary is mine, as are any mistakes. Thanks to Chad, Richard, Jon, Ryan and everyone else!


Hipster Hosting at BSW, Tomorrow Only

Lady with computer mouse

She doesn’t look like she needs hosting, does she?

I’m doing a short presentation with a few other people at Boulder Startup Week on hosting. Tomorrow, Thur, at 10am MT.

Would love to see you there. Feel free to heckle.

If you can’t make it, here is the salient point of my presentation: startups are hard, so you should host your code and infrastructure at the highest level of abstraction that you can, so that your developers can focus on delivering business value through new features rather than doing ops. In practice, prefer hosting options in this order:

  • serverless
  • platform specific hosting (wpengine, etc)
  • general purpose PAAS (heroku, elastic beanstalk)
  • cloud VMs
  • colo
  • server in the closet

Of course, all advice is context dependent; my advice is aimed at small startups and the more flexibility your developers need around aspects of technology the lower on the list you’ll have to go.

Anyway, looking forward to a good discussion.


Trust the compiler

I loved this post not because I love reading assembler but because it just illustrates so perfectly how often, when writing software, we can easily stand on the shoulders of giants.

My point is not that we should take what we’ve learnt from the LLVM-generated code and write a new version of our hand-rolled assembly. The point is that optimising compilers are really good. There are very smart people working on them and computers are really good at this kind of optimisation problem (in the mathematic sense) in a way that humans find quite difficult. It’s the job of language designers to give us the tools we need to inform the optimiser as best we can as to what our true intent is, and larger integer sizes are another step towards that.


Boulder Startup Week Begineth!

Thumb upBoulder Startup Week is this week. If you haven’t been, it’s a great opportunity for a number of reasons. You can get a flavor of the Boulder tech community (though it’s worth remembering that there are numerous firms that don’t play in the startup world that are in Boulder). You can learn a lot about startups from folks who are actively building one, or have built one in the past. You can learn about new technologies and trends that are up and coming, including data science, blockchain and cannibis. And you can meet a lot of great folks.

I’m a bit burned out on startups at the moment, but am still planning to attend a few sessions, mostly on the development track. I’m especially excited for the Boulder Ruby Meetup on Wednesday, where experts will speak about interviewing. I’m also speaking at a session on hosting.

My tips for Boulder Startup Week:

  • go to at least one session in a different area of focus than you normally would.
  • arrive 5-10 minutes early and plan to stay 5-10 minutes after. Use this time to chat with folks. (This is hard for me, but I’ve found that having a canned opening line like “what interesting talks have you seen” or “is this your first time at BSW” is a good way to break the ice.)
  • the above tip will prevent you from trying to attend too many sessions back to back to back. This is a Good Thing(tm).
  • bring business cards, or prepare to exchange emails.
  • thank a volunteer and/or sponsor when you see them. There’s a tremendous amount of effort that goes into this week.
  • be prepared to help someone you meet out, with an intro, feedback on an idea, or even just an interesting article.
  • if a session is full, I’d get on the waitlist and then I’d show up anyway. Because every session is free, I’ve found that oftentimes folks are … over committed and there’s often space for other people.

If your interest has been piqued, please check out the schedule. Hope to see you out there.



© Moore Consulting, 2003-2017 +