Terraform with multiple workspaces and environments

I recently was setting up a couple of AWS environments for a client. This client had a typical web application which talked to an RDS database. There was DNS, a CDN and other components involved. We wanted to use Terraform to maintain traceability and replicability, and have the same configuration for production and staging, with perhaps small differences like ec2 instance size. We also wanted to separate out the components into their own Terraform workspaces to limit the blast radius (so if one component had changes that caused issues or Terraform corruption, it wouldn’t affect others). Finally, we wanted each environment to have its own Terraform backend, again to separate the environments.

I wasn’t able to complete this project due to external factors (I left the position before testing could be completed), but wanted to share the concepts. Obviously I can’t share the working code, but I set up an example project which is simpler. That’s the project I’ll be examining in this post. I also want to be clear that while I’ve tested this as much as I could and have validated the ideas with others who have more Terraform experience, this hasn’t been run in production. You have been warned. (Here’s the Terraform docs about setting up modules, workspaces and repositories.)

Using a tool like Terraform is great for a number of reasons, but my favorite is that it lets you track changes to cloud infrastructure. More than once I’ve wandered into an AWS account and wondered why certain resources were set up in the way they were, and what might break if I changed them. There are occasionally comments, but it is far better to examine a commit. Even better to review the set of commits and see the customer request or bug tied to it. (Bonus link: learn more about Terraform and other cloudy tools in this podcast episode with the creator of Terraform.)

So this simpler example project has a lambda that writes to an SQS queue. For now, it just writes the date of invocation, but obviously you could have it reach out to an external API, read from a database, or do some kind of calculation. The SQS queue could then be read from by an EC2 instance, which processes the message and perhaps updates a database. You have three components of the system:

  • The lambda function
  • The SQS queue
  • The EC2 instance (implementation of which is left as an exercise for the reader)

The SQS queue is shared infrastructure and needs to be accessed by both of the other systems. However, the SQS system doesn’t need to know about either the lambda or the EC2 instance. Using Terraform, we can create each of these components as their own workspace. Each of the subsidiary systems can evolve or change (for instance, the EC2 instance could be replaced with an autoscaling group) with minimal impact on other systems. They could be managed by different teams as well if that made sense.

To enforce this separation, set up each component as a separate Terraform workspace. (All code is on github here.) I use remote state so that more than one person can manage the terraform state, and use the S3/dynamodb backend because we are targetting AWS and want a free scalable solution. This post assumes you know how to set up Terraform using s3/dynamodb as a remote state storage.

Here’s the outputs of the SQS system:

output "queue_url" {
  value = "${aws_sqs_queue.myqueue.id}"
}

output "queue_arn" {
  value = "${aws_sqs_queue.myqueue.arn}"
}

I explicitly define the output variables so I can pull them in from the lambda and EC2 workspaces. This is how you can do that.

...
data "terraform_remote_state" "sqs" {
  backend = "s3"
  config = {
    bucket = "${var.terraform_bucket}"
    key = "sqs/terraform.tfstate"
    encrypt = true
    dynamodb_table = "terraform-remote-state-locks"
    profile = "${var.aws_profile}"
    region = "us-east-2"
  }
}
...
resource "aws_lambda_function" "mylambda" {
...
  environment {
    variables = {
      sqs_url = "${data.terraform_remote_state.sqs.outputs.queue_url}"
    }
  }
}

The terraform_remote_state block defines the location of the previously defined sqs workspace, and the ${data.terraform_remote_state.sqs.outputs.queue_url} references that url. That is then injected as an environment variable into the lambda, which reads it and uses the url to create an SQS client. It can then post whatever message it wants.

You can see how this would work with any number of configuration parameters. If you have typical three tier database driven application with a separate caching layer you can create each of these major components and inject the values into either the environment (for lambda) or the userdata (for EC2). I’m not sure I’d use this with a microservices architecture because using a services registry might be more appropriate.

Note that the lambda component has a rudimentary lambda function (you have to define something). It also uses Terraform to deploy the lambda code. That’s fine for the toy example, but for production you will want to use a real CI/CD system to deploy your lambdas.

Now, suppose you want to run production and staging environments, because you are ready to launch. Here are the constraints you’d want:

  • Production and staging run the same config (except when staging is changing, of course)
  • Production and staging may differ in a few details (the size of the EC2 instance, for example)
  • Production and staging execute in different AWS accounts to limit access and issues (charity link). You don’t want an error in staging to affect production. This is handled by creating different profiles which have access to different accounts.
  • Production and staging execute in different Terraform backends for the same reason as the separate AWS accounts.

Staging and production can use the same git repository, but when pulled down they are kept in two places on the filesystem. This is because you need to specify the profile and the bucket when using terraform init. So you end up running something like these two commands:

git clone git@github.com:mooreds/terraform-remote-state-example.git # staging
git clone git@github.com:mooreds/terraform-remote-state-example.git production-terraform-remote-state-example # production

I set up the project so that staging can be managed by normal terraform commands (since that will happen more often), and that production uses either special incantations or a script. For the initialization of the production Terraform environment, this looks like: terraform init -backend-config="profile=trsproduction" -backend-config="bucket=". For staging, it’s just terraform init. I didn’t have a lot of luck switching between these two Terraform backends in the same filesystem locations, so that having two trees was a straightforward workaround.

Any changes between production and staging are each pulled out to a variable, with the staging value as the default. Then each workspace has a script which applies the Terraform configuration to the production environment. The script sets variables to be the correct value for production. Here’s an example for the lambda workspace:

terraform apply -var aws_profile=trsproduction -var terraform_bucket="mooreds-terraform-remote-state-example-production" -var env_indicator="production" -var lambda_memory_size=256

We pass in the production terraform_bucket in case any references need to be made to the remote state (to pull in the SQS queue url, for example). We also pass in an increased lambda memory size because, hey, it’s production. Other things that might vary between environments: for example, VPC or subnet ids, API endpoints, and S3 bucket names.

For simplicity, we just use two profiles for staging and production (in ~/.aws/credentials), but any way of getting credentials that works with Terraform will work:

[trsstaging]
aws_access_key_id = ...
aws_secret_access_key = ...

[trsproduction]
aws_access_key_id = ...
aws_secret_access_key = ...

This lets us separate out who has production access. Some users can have both staging and production profiles (perhaps operations), and others can have only staging profiles (perhaps developers). You can pass region values in via variables as well.

Using this system, the workflow for a change would be:

  • Check out the terraform git repository
  • Create a feature branch (including an issue identifier)
  • Pull request and approval
  • Run terraform apply to apply to staging
  • Run any additional tests
  • Merge to master
  • Run prodapply.sh

Again, I want to be clear that I’ve implemented this partially, but I didn’t get a chance to run this fully in production. I tested all these concepts with the simple system mentioned above (and you can stand up your own using the code on github). There will be issues that I haven’t experienced. But I hope that this post helps illuminate the complexity of managing multiple workspaces and environments within a single Terraform github repository.


Ever felt like your codebase was out of control?

I certainly have. A couple of times in my career the combination of technical debt, business model shift and lack of time for a proper fix have left me feeling out of control.

But reading this post on Hacker News made me realize that it all could have been so so much worse. A couple of “best ofs”:

To give you some examples, I originally came on as a contractor because they had some refactoring they wanted done. The entire system was home built (including the programming language) and there was a file size limit of 32,767 lines. They had many functions that were approaching this limit and they didn’t know what to do, so they hired me.

and:

Once upon a time, there was a search product and one of the data sources that it could search was a Solr/Lucene database. This should be no problem, since search is what Solr does. It should be as simple as passing the user’s query through to Solr and then reading the response. The problem was, it was important to know exactly which parts of any matched records were relevant to the search.

 

The Guy Before Me™ decided that the best way to implement this would be to split the user’s search into individual words, perform a separate search query through Solr’s HTTP API for each individual word, and then do a bunch of very clever and complex post-processing on the result sets to combine them into a single set of results.

and (last one, I promise):

At my first gig I teamed up with a guy responsible for a gigantic monolith written in Lua. Originally, the project started as a little script running in Nginx. Over the course of several years, it organically grew to epic proportions, by consuming and replacing every piece of software that it interfaced with – including Nginx.

 

There were two ingredients in the recipe for disaster. The first is that Lua comes “batteries excluded”: the standard library is minimalist and the community and set of available packages out there is small. That’s typically not an issue, as long as one uses Lua in the intended way: small scripts that extend existing programs with custom user logic (e.g. Nginx, Vim, World of Warcraft). The second is that Lua is a dynamic language: it’s dynamically typed, and practically everything can be overridden, monkey patched and hacked, down to the fundamental iterators that allow you to traverse data structures.

shivers. There, but for the grace of God.


Easily extracting conversations from a slack group

Man slack liningSlack is an amazing productivity tool when used correctly. One of the primary uses I’ve seen is for open source projects to provide support (Craft CMS, OG-AWS) or for communities to be built (Techfriends, Denver Devs). If you don’t have the luxury of the owner of your slack being Slack’s VP of engineering, the costs of $x/month/user can cause these types of slacks to remain on the free plan.

Which means that you are limited to the last 10k of messages.

And that’s fine for the vast majority of messages. Sometimes, however a discussion is so good that it deserves to be indexed and shared, which means it needs to be pulled out of the Slack walled garden and onto the web (I also wrote about how to do this with the Facebook Group walled garden last year). Sometimes you might just want to save it beyond the 10k message limit for your own selfish reasons.

You can of course do this extraction manually (I did so here and here). But that’s a lot of work.

Another option is to use Zapier. The slack integration is trivial to set up, and has a number of options. From there you can push to a google spreadsheet (if you want to do further reification) or directly to WordPress (or any of the other integrations).

The nice part about this is that the Zapier slack integration is that you have a variety of options that can trigger the publishing of a message to a spreadsheet:

  • a post of a public message in a specific channel
  • a post of a public message in any channel
  • starring of a message by you
  • attachment of a certain reaction emoji (I picked a floppy disk) to a message, no matter who adds the emoji

I’ve just started doing this but am excited to have a low friction way to pull high value conversations out of slack. Slack is great for synchronous communication and easy discussion. When real knowledge drops, it should be shared with the future and anyone who can type into a search box. Do make sure to let folks know because there may be some expectation of privacy that you’ll want to respect.


Obstacles to building high availability software systems

Open sign

Is your system available?

I saw a discussion on a slack about obstacles to high availability systems and wanted to record the edited version for posterity (mostly for future me, as I blog for myself). Note that in any mention of high availability systems would be remiss if I didn’t mention the Google SRE book, which is slow reading but free and full of great information.

First, what is high availability? I like this definition from Digital Ocean:

In computing, the term availability is used to describe the period of time when a service is available, as well as the time required by a system to respond to a request made by a user. High availability is a quality of a system or component that assures a high level of operational performance for a given period of time.

Design considerations of a system that will hinder high availability fall into two categories.

The first category is actions that you don’t take, but could take:

  • single points of failure: if you have a piece of your system which is unique and it fails (and everything fails, all the time), the entire system’s availability will be affected.
  • missing or incomplete automation: if you need human beings to resurrect failed parts of your system, it will meaningful amounts of time and will be error prone.
  • failing to build in elasticity and scalability of resources: when usage increases, new resources should be automatically brought online. Failure to do so will impact system performance and that could impact system availability
  • missing or incomplete system instrumentation: if you don’t monitor your system, you won’t be able to even know its availability (until you hear from your users).
  • application statefulness (on the compute nodes): this impacts your ability to use elastic resources and to grow parts of your system that are under load. (If you aren’t designing a greenfield system, this may be an externally imposed requirement due to existing software.)

The second is in actions you can’t take because of external requirements on the system:

  • data sovereignty: if you are legally limited to certain data centers, you have fewer options for your system, this can hinder building the system.
  • tenancy: if you need to have single tenancy for security or legal reasons, you may have fewer options for elastic solutions.
  • data models and authority requirements: poorly performing data models can impact performance. If your application requires certain operations must be from the source of record (permissions checks, for example) then a poorly performing source data model can impact performance which can impact availability.
  • latency: if you have a highly latency sensitive system, then you may need to trade availability for decreased latency. Since availability often means geographic dispersion (to avoid disasters impacting multiple pieces of a system), it impacts latency requirements.
  • cost: high availability systems, because they have no single points of failure, cost more.

Again, this was a discussion from a slack of AWS instructors, but the commentary is mine, as are any mistakes. Thanks to Chad, Richard, Jon, Ryan and everyone else!


Hipster Hosting at BSW, Tomorrow Only

Lady with computer mouse

She doesn’t look like she needs hosting, does she?

I’m doing a short presentation with a few other people at Boulder Startup Week on hosting. Tomorrow, Thur, at 10am MT.

Would love to see you there. Feel free to heckle.

If you can’t make it, here is the salient point of my presentation: startups are hard, so you should host your code and infrastructure at the highest level of abstraction that you can, so that your developers can focus on delivering business value through new features rather than doing ops. In practice, prefer hosting options in this order:

  • serverless
  • platform specific hosting (wpengine, etc)
  • general purpose PAAS (heroku, elastic beanstalk)
  • cloud VMs
  • colo
  • server in the closet

Of course, all advice is context dependent; my advice is aimed at small startups and the more flexibility your developers need around aspects of technology the lower on the list you’ll have to go.

Anyway, looking forward to a good discussion.



Imposter syndrome

This article resonated with me. I became familiar with imposter syndrome when my SO spoke on it several times (she’s available to speak to your group if you’d like).

When you are deep in a discipline, it can be very easy to “know what you don’t know” and downplay your expertise. I often am asked to support desktop computers because I work in software (a la this post). But I know how little I know about the problem.

I think the issue is also exacerbated by the continuous flow of information that we are all offered by the internet. This makes it very easy to compare ourselves with what other folks choose to share (typically, though not always, their best side and successes). This makes me, I will be honest, feel inadequate. Why didn’t I learn more about k8s? Why haven’t I built a successful saas business? Why haven’t I worked at scale like that? Why haven’t I built a react native app? And so on and so on.

And when someone asks me “can you do that?” I always have that moment of fear and have to force myself to say yes.

My answer is to breathe, take chances, remember that failure is an option, and recall that while we see other people’s successes, we rarely see their failures. It isn’t fair to me to compare my “inside” with someone else’s “outside”.


Qualifying “leads” with two simple questions

Toddler girl

Every “lead” started out as one of these.

I use the word “lead” carefully, because every lead is actually a person with desires and hopes and dreams and fears. And it’s worth humanizing them.

But, a “lead” is also a prospect for business. When I ran my consulting company, I was always happy to take coffee because you never knew what could turn up. However, I enjoyed this medium post about how Seamus qualifies leads for his consulting business by asking two simple questions. I also like that he’s explicit about projects that aren’t a fit. It’s hard and scary to niche and yet so worthwhile.

From the post:

I can’t control how I’m introduced to people or how OTL Ventures has been described. So I have found it helpful to be upfront about what OTL Ventures does. This also gives the person who wants to meet with me an opportunity to self-select out of the meeting if they aren’t a good fit. I’ve been doing this by including my answer to the same two questions in my response. It only seems fair.

When you think about it, having this kind of prep conversation is good for both sides. It makes everyone think about what kind of value they bring and can get from a meeting.


Navigating new systems

A mazeHere are some tips and tricks I have for navigating new software systems, which can sometimes be like navigating a maze. If you’re truly unlucky, it’s a maze, but you’re blindfolded and the walls are covered in randomly placed razors.

The first is to get a clear set of expectations. Will I own the system? Who owns it now? How long have they owned it? How often is it modified? When will it need to be modified again? Is it shaky or stable? Getting these questions answered helps me understand and refine further steps.

The next step is to gain access. There are a lot of different pieces of most modern systems, so access can mean different things. Here are some kinds of access which it may be worth seeking:

  • shell
  • version control
  • database
  • ftp
  • http
  • app level (admin, user)
  • documentation
  • project planning
  • different CI environments (prod, UA, staging)
  • build system
  • admin users
  • end users

After I have access, I like to look at the front end and the back end. By the front end, I mean the user interface. And by the back end I mean the data store. Just looking around and seeing what tables and pages an application has can help.

If the system in question is not entirely custom built, googling for the user guide for the default version of the application can be helpful. Finding that user guide and skimming through it can give me more high level understanding, as well as teaching me key nomenclature. Of course if there is any local documentation, that’s helpful too, but I read that with a skeptical eye, as it doesn’t always keep pace with the system.

I also like to look at logfiles. This can help me determine something as simple as if I’m on the correct server (if I reload the page and the access log file doesn’t change, I am looking at the wrong log file or am on the the wrong server). Even better if the system aggregates logs into something like an ELK system or papertrail.

Setting up a local development environment can help. Again, this lets me gain an understanding of the big picture components, and also lets me poke at various parts of a system, possibly breaking them, without affecting other developers or, worse, customers.

Asking questions is really important, but this can be hard because often the folks with the most knowhow are the busiest.

I also like to see what files or database tables change as I move through the system. With a modest sized database, I do this by taking a database dump before taking some action and then after. Then I diff the files, sometimes using sed to break the dump file apart even further (replacing all commas with commas and newlines, for example). If using mysqldump, you can target individual tables and make sure not to use extended inserts, as that makes diffing harder.

For the filesystem, it’s even easier. I touch a file (ts) and then take the action, then run find . -newer ts -print. This command will show me all the files the system has written that are newer than ts.

Hopefully some of these tips will be helpful to you as you navigate your next new system.


Big hammer or small hammer?

HammerI was talking to a friend the other day about startup vs big company life and he used an analogy so good I’m going to steal it (and expand upon it).

If you think about the problem you are trying to solve as a rock, and the business you are in that is trying to solve as a hammer with which to chisel or otherwise transform said rock, you can choose a brick hammer (which is a small hammer) or a sledge hammer (a large, heavy hammer), or anything in between.

The smaller the hammer, the more effort you have to put into the swing. However, it’s fairly easy to pick up, to manipulate and to re-orient if you decide you need to approach the rock from a different viewpoint.

If you, on the other hand, choose the sledgehammer, then when you swing you are wielding a lot of force. It becomes easier to make progress on your initial approach, but if you need to switch up your emphasis, it’s going to take some time, because of the weight of the hammer.

The larger the business, the more leverage and power you have to attack a single problem. I’ve worked at large companies in the past, and I can tell you the size and scope of problems they were able to work on, often in parallel, were amazing. However, there was a lot of time and effort spent on coordinating those efforts, and and a lot of bureaucracy and red tape if there was a process improvement needed. (There was also dead weight at some of these companies.)

At a smaller company or a startup, as I’ve worked at in the past, I didn’t have the bandwidth to take on multiple large projects. Doing more than one or two major projects was a recipe for distraction and impotence. However, when focusing on one effort, it was easy to try different approaches, work really hard and be super flexible when incorporating feedback from customers and iterating.

There are strengths in both the small and big hammer approaches. The important thing is to choose what is a good fit for both the problem you are trying to solve and your working style (which may change over time).



© Moore Consulting, 2003-2019