Obstacles to building high availability software systems

I saw a discussion on a slack about obstacles to high availability systems and wanted to record the edited version for posterity (mostly for future me, as I blog for myself). Note that in any mention of high availability systems would be remiss if I didn’t mention the Google SRE book, which is slow reading but free and full of great information.

First, what is high availability? I like this definition from Digital Ocean:

In computing, the term availability is used to describe the period of time when a service is available, as well as the time required by a system to respond to a request made by a user. High availability is a quality of a system or component that assures a high level of operational performance for a given period of time.

Design considerations of a system that will hinder high availability fall into two categories.

The first category is actions that you don’t take, but could take:

single points of failure: if you have a piece of your system which is unique and it fails (and everything fails, all the time), the entire system’s availability will be affected.
missing or incomplete automation: if you need human beings to resurrect failed parts of your system, it will meaningful amounts of time and will be error prone.
failing to build in elasticity and scalability of resources: when usage increases, new resources should be automatically brought online. Failure to do so will impact system performance and that could impact system availability
missing or incomplete system instrumentation: if you don’t monitor your system, you won’t be able to even know its availability (until you hear from your users).
application statefulness (on the compute nodes): this impacts your ability to use elastic resources and to grow parts of your system that are under load. (If you aren’t designing a greenfield system, this may be an externally imposed requirement due to existing software.)

The second is in actions you can’t take because of external requirements on the system:

data sovereignty: if you are legally limited to certain data centers, you have fewer options for your system, this can hinder building the system.
tenancy: if you need to have single tenancy for security or legal reasons, you may have fewer options for elastic solutions.
data models and authority requirements: poorly performing data models can impact performance. If your application requires certain operations must be from the source of record (permissions checks, for example) then a poorly performing source data model can impact performance which can impact availability.
latency: if you have a highly latency sensitive system, then you may need to trade availability for decreased latency. Since availability often means geographic dispersion (to avoid disasters impacting multiple pieces of a system), it impacts latency requirements.
cost: high availability systems, because they have no single points of failure, cost more.

Again, this was a discussion from a slack of AWS instructors, but the commentary is mine, as are any mistakes. Thanks to Chad, Richard, Jon, Ryan and everyone else!

Obstacles to building high availability software systems

One thought on “Obstacles to building high availability software systems”

Letters to a New Developer

Pages

Subscribe

Socials

Categories

Archives