Obstacles to building high availability software systems

Open sign

Is your system available?

I saw a discussion on a slack about obstacles to high availability systems and wanted to record the edited version for posterity (mostly for future me, as I blog for myself). Note that in any mention of high availability systems would be remiss if I didn’t mention the Google SRE book, which is slow reading but free and full of great information.

First, what is high availability? I like this definition from Digital Ocean:

In computing, the term availability is used to describe the period of time when a service is available, as well as the time required by a system to respond to a request made by a user. High availability is a quality of a system or component that assures a high level of operational performance for a given period of time.

Design considerations of a system that will hinder high availability fall into two categories.

The first category is actions that you don’t take, but could take:

  • single points of failure: if you have a piece of your system which is unique and it fails (and everything fails, all the time), the entire system’s availability will be affected.
  • missing or incomplete automation: if you need human beings to resurrect failed parts of your system, it will meaningful amounts of time and will be error prone.
  • failing to build in elasticity and scalability of resources: when usage increases, new resources should be automatically brought online. Failure to do so will impact system performance and that could impact system availability
  • missing or incomplete system instrumentation: if you don’t monitor your system, you won’t be able to even know its availability (until you hear from your users).
  • application statefulness (on the compute nodes): this impacts your ability to use elastic resources and to grow parts of your system that are under load. (If you aren’t designing a greenfield system, this may be an externally imposed requirement due to existing software.)

The second is in actions you can’t take because of external requirements on the system:

  • data sovereignty: if you are legally limited to certain data centers, you have fewer options for your system, this can hinder building the system.
  • tenancy: if you need to have single tenancy for security or legal reasons, you may have fewer options for elastic solutions.
  • data models and authority requirements: poorly performing data models can impact performance. If your application requires certain operations must be from the source of record (permissions checks, for example) then a poorly performing source data model can impact performance which can impact availability.
  • latency: if you have a highly latency sensitive system, then you may need to trade availability for decreased latency. Since availability often means geographic dispersion (to avoid disasters impacting multiple pieces of a system), it impacts latency requirements.
  • cost: high availability systems, because they have no single points of failure, cost more.

Again, this was a discussion from a slack of AWS instructors, but the commentary is mine, as are any mistakes. Thanks to Chad, Richard, Jon, Ryan and everyone else!


Hipster Hosting at BSW, Tomorrow Only

Lady with computer mouse

She doesn’t look like she needs hosting, does she?

I’m doing a short presentation with a few other people at Boulder Startup Week on hosting. Tomorrow, Thur, at 10am MT.

Would love to see you there. Feel free to heckle.

If you can’t make it, here is the salient point of my presentation: startups are hard, so you should host your code and infrastructure at the highest level of abstraction that you can, so that your developers can focus on delivering business value through new features rather than doing ops. In practice, prefer hosting options in this order:

  • serverless
  • platform specific hosting (wpengine, etc)
  • general purpose PAAS (heroku, elastic beanstalk)
  • cloud VMs
  • colo
  • server in the closet

Of course, all advice is context dependent; my advice is aimed at small startups and the more flexibility your developers need around aspects of technology the lower on the list you’ll have to go.

Anyway, looking forward to a good discussion.



Imposter syndrome

This article resonated with me. I became familiar with imposter syndrome when my SO spoke on it several times (she’s available to speak to your group if you’d like).

When you are deep in a discipline, it can be very easy to “know what you don’t know” and downplay your expertise. I often am asked to support desktop computers because I work in software (a la this post). But I know how little I know about the problem.

I think the issue is also exacerbated by the continuous flow of information that we are all offered by the internet. This makes it very easy to compare ourselves with what other folks choose to share (typically, though not always, their best side and successes). This makes me, I will be honest, feel inadequate. Why didn’t I learn more about k8s? Why haven’t I built a successful saas business? Why haven’t I worked at scale like that? Why haven’t I built a react native app? And so on and so on.

And when someone asks me “can you do that?” I always have that moment of fear and have to force myself to say yes.

My answer is to breathe, take chances, remember that failure is an option, and recall that while we see other people’s successes, we rarely see their failures. It isn’t fair to me to compare my “inside” with someone else’s “outside”.


Qualifying “leads” with two simple questions

Toddler girl

Every “lead” started out as one of these.

I use the word “lead” carefully, because every lead is actually a person with desires and hopes and dreams and fears. And it’s worth humanizing them.

But, a “lead” is also a prospect for business. When I ran my consulting company, I was always happy to take coffee because you never knew what could turn up. However, I enjoyed this medium post about how Seamus qualifies leads for his consulting business by asking two simple questions. I also like that he’s explicit about projects that aren’t a fit. It’s hard and scary to niche and yet so worthwhile.

From the post:

I can’t control how I’m introduced to people or how OTL Ventures has been described. So I have found it helpful to be upfront about what OTL Ventures does. This also gives the person who wants to meet with me an opportunity to self-select out of the meeting if they aren’t a good fit. I’ve been doing this by including my answer to the same two questions in my response. It only seems fair.

When you think about it, having this kind of prep conversation is good for both sides. It makes everyone think about what kind of value they bring and can get from a meeting.


Navigating new systems

A mazeHere are some tips and tricks I have for navigating new software systems, which can sometimes be like navigating a maze. If you’re truly unlucky, it’s a maze, but you’re blindfolded and the walls are covered in randomly placed razors.

The first is to get a clear set of expectations. Will I own the system? Who owns it now? How long have they owned it? How often is it modified? When will it need to be modified again? Is it shaky or stable? Getting these questions answered helps me understand and refine further steps.

The next step is to gain access. There are a lot of different pieces of most modern systems, so access can mean different things. Here are some kinds of access which it may be worth seeking:

  • shell
  • version control
  • database
  • ftp
  • http
  • app level (admin, user)
  • documentation
  • project planning
  • different CI environments (prod, UA, staging)
  • build system
  • admin users
  • end users

After I have access, I like to look at the front end and the back end. By the front end, I mean the user interface. And by the back end I mean the data store. Just looking around and seeing what tables and pages an application has can help.

If the system in question is not entirely custom built, googling for the user guide for the default version of the application can be helpful. Finding that user guide and skimming through it can give me more high level understanding, as well as teaching me key nomenclature. Of course if there is any local documentation, that’s helpful too, but I read that with a skeptical eye, as it doesn’t always keep pace with the system.

I also like to look at logfiles. This can help me determine something as simple as if I’m on the correct server (if I reload the page and the access log file doesn’t change, I am looking at the wrong log file or am on the the wrong server). Even better if the system aggregates logs into something like an ELK system or papertrail.

Setting up a local development environment can help. Again, this lets me gain an understanding of the big picture components, and also lets me poke at various parts of a system, possibly breaking them, without affecting other developers or, worse, customers.

Asking questions is really important, but this can be hard because often the folks with the most knowhow are the busiest.

I also like to see what files or database tables change as I move through the system. With a modest sized database, I do this by taking a database dump before taking some action and then after. Then I diff the files, sometimes using sed to break the dump file apart even further (replacing all commas with commas and newlines, for example). If using mysqldump, you can target individual tables and make sure not to use extended inserts, as that makes diffing harder.

For the filesystem, it’s even easier. I touch a file (ts) and then take the action, then run find . -newer ts -print. This command will show me all the files the system has written that are newer than ts.

Hopefully some of these tips will be helpful to you as you navigate your next new system.


Big hammer or small hammer?

HammerI was talking to a friend the other day about startup vs big company life and he used an analogy so good I’m going to steal it (and expand upon it).

If you think about the problem you are trying to solve as a rock, and the business you are in that is trying to solve as a hammer with which to chisel or otherwise transform said rock, you can choose a brick hammer (which is a small hammer) or a sledge hammer (a large, heavy hammer), or anything in between.

The smaller the hammer, the more effort you have to put into the swing. However, it’s fairly easy to pick up, to manipulate and to re-orient if you decide you need to approach the rock from a different viewpoint.

If you, on the other hand, choose the sledgehammer, then when you swing you are wielding a lot of force. It becomes easier to make progress on your initial approach, but if you need to switch up your emphasis, it’s going to take some time, because of the weight of the hammer.

The larger the business, the more leverage and power you have to attack a single problem. I’ve worked at large companies in the past, and I can tell you the size and scope of problems they were able to work on, often in parallel, were amazing. However, there was a lot of time and effort spent on coordinating those efforts, and and a lot of bureaucracy and red tape if there was a process improvement needed. (There was also dead weight at some of these companies.)

At a smaller company or a startup, as I’ve worked at in the past, I didn’t have the bandwidth to take on multiple large projects. Doing more than one or two major projects was a recipe for distraction and impotence. However, when focusing on one effort, it was easy to try different approaches, work really hard and be super flexible when incorporating feedback from customers and iterating.

There are strengths in both the small and big hammer approaches. The important thing is to choose what is a good fit for both the problem you are trying to solve and your working style (which may change over time).


AWS Quick Starts

StopwatchIf you are looking to stand up an application quickly, I often recommend the AWS marketplace. This service has thousands of vendor maintained solutions and is a great way to get going quickly. Note that some of the solutions have extra per hour charges, and if that is the case per second billing won’t apply. These solutions are focused on individual AWS EC2 instance images (so you can quickly stand up a phpbb instance or a redmine server, for example).

However, another good option is AWS Quick Starts. These are recipes for deployments, possibly of multiple virtual machines, and are aimed at handling larger business problems. There are over 80 listed on the Quick Start page right now, ranging from creating a data lake to a HIPAA reference architecture to running devops tools like consul and bitbucket. These solutions may or may not carry additional charges, so make sure review licensing and billing information as well as functionality.

If you are thinking about setting up a complex system in AWS, it’s worth some time to see if someone has put a reference Quick Start together. It may not fit your needs perfectly, but can be a good place to begin.


How would you build a simple CRUD app in 2018?

Interesting discussion on HN about how you’d build a simple CRUD app in 2018. (CRUD stands for Create, Read, Update and Delete of records in a system.) As you’d expect, no shortage of opinions, with a lot of specifics. Also lots of recommendations to use what you know, which is always a good idea. CRUD apps, where you are simply gathering data via a web interface, run the gamut of complexity and usability, but it’s hard to believe how powerful having a centralized repository of data can be.

And the nice thing is that once you get the data into the backed of  a CRUD app (which is likely a SQL compatible database, but could be any other kind of datastore) you can either combine that data with other systems, expose the data via an API, or both.

Here was my answer to the question:

I’d use Rails today. But I can see the value in Django or Express or ASP.Net or Spring Boot.

 

Criteria I’d think about:

 

What is going to be easy for my organization to support (both operationally and with future code changes).

 

What do I know/want to learn? If the CRUD app is complicated, then I want a tech I know well. If the CRUD app is simple, then I may want to experiment with a different technology (again, within the organizational support guardrails).

 


Look for the value of the param, not its existence

We are rewriting the front end of the application on which I work. Several of the forms which gather data that drives the application previously had checkboxes, which would set a value to true or false in the database.

Some of the checkboxes were ancillary, however, and didn’t result in a database value. Instead, they were used to calculate a different result depending on whether the checkbox was checked or not. That result would then be saved to the database. This was all tested via rspec controller tests.

In at least one case, the calculation depended on the existence of the parameter. Checkbox parameters are sent when they are checked and omitted when they are not.

Then we changed the forms to radio buttons.

Uh oh.

Instead of sending the parameter if the box was checked yes or not sending the parameter if it was checked no, we sent yes if the radio button was checked and no if it was not.

Uh oh.

The tests continued to pass, because they hadn’t been updated to the new values.

Uh oh.

So the calculation was incorrect. Which meant that the functionality that depended on the calculation was incorrect. Unfortunately, that functionality was related to billing. So I just spent the last few days backing that out, checking with clients to see what they actually intended to do, and in general fixing the issue.

The underlying code was an easy enough fix, but this story just goes to show that software is complicated. Lessons learned:

  • favor a test for an exact parameter value rather than parameter existence
  • when changing forms, consider writing tests with the new input values
  • integration tests that drive a browser would have caught this issue. These are much slower so need to be used judiciously, but are still a valuable automated test

 



© Moore Consulting, 2003-2017 +