Software infrastructure configuration options

I ran across this great article when I was reading up on Terraform.

It does a good job of running through the options (puppet, cloudformation, etc) on how to set up your infrastructure via software. Here’s a great quote on why they chose Terraform:

On the other hand, with the kind of declarative approach used in Terraform, the code always represents the latest state of your infrastructure. At a glance, you can tell what’s currently deployed and how it’s configured, without having to worry about history or timing. This also makes it easy to create reusable code, as you don’t have to manually account for the current state of the world. Instead, you just focus on describing your desired state, and Terraform figures out how to get from one state to the other automatically.


Serverless Framework

I had coffee with an acquaintance who is doing a lot of event driven data processing. Whereas ten years ago to tackle this problem you might use an ETL tool like Pentaho or Talend, now his process runs entirely on AWS Lambda functions. He is leveraging the Serverless framework to manage and deploy these applications. As I understand it there is a thin shim layer between the business logic and the lambda event handler, but the business logic is isolated and knows nothing about its environment. That makes the business logic very testable.

His description of the Serverless framework intrigued me. As he described it, the framework is driven by a simple yaml file and takes care of, among other tasks, the complicated infrastructure set up to tie Lambda functions to a variety of AWS events. I haven’t done it myself, but I’ve heard that setting up a lambda to API Gateway link is a real bear. Doing so allows a lambda function respond to a web requests without any AWS authentication, and is a key use case.

You can write and deploy lambda functions in any language that AWS Lambda supports (unfortunately, not java 9 at the moment). Here’s a java/maven/serverless tutorial. It also supports multiple cloud providers, though I haven’t done much beyond note that the documentation exists.

However, using Serverless does require writing code. If evaluating a a complicated ETL process which non developers needed to be able to understand and support, Serverless would not be a good fit. I’m not aware of any abstraction layers on top of it, though I guess you could run, for example, Pentaho Kettle jobs within lambda. There’s also an issue around cold start times–when your code hasn’t been invoked for a while, it can take longer to start up when a request or event occurs. Apparently there are partial solutions, but your lambdas still get cycled every few hours regardless.

I worked through some of the tutorials and was impressed at just how easy it was to get started. If I had a simple API or data processing pipeline to build, Serverless would definitely be on my short list of possible implementation options. It is very inexpensive, scales easily and encourages encapsulation.

Incidentally, my acquaintance’s company is hosting a lunch and learn on this technology at the end of the month. More details here.


“The future is already here, but it’s only available as a managed AWS service”

This entire post about how Kubernetes could become the distributed operating system of choice is worth reading.  But one statement really struck me:

Well, as they say, the future is already here, but it’s only available as an AWS managed service.

The “they” in this is apparently not William Gibson, as I thought.  More details here.

For the past couple of years the cloud providers have matured and moved from offering infrastructure as a service (disk, compute) to platform as a service offerings (sqs, which is a managed message queue like activemq, or kinesis, a managed data ingestion system like kafka, etc).  Whenever you think about installing a proprietary or open source package, you should include the cloud provider offerings in your evaluation matrix.  Of course, the features you need may not be there, or the cost may be prohibitive, but including them in an evaluation makes sense because of the speed of deployment and the scaling available.

If you think a system architecture can benefit from a message queuing system, do you want to spend time setting up and maintaining such a system, or do you want to spin up an SQS queue in a few minutes?

And the cost may not be prohibitive, depending on the skillset of your internal team and your team’s desire to run such plumbing services.  It can be really hard to estimate running costs of infrastructure services, though you can estimate it by looking at internal teams and seeing similar services they run and how much money it takes.  The nice thing about cloud services is that the costs are very transparent.  The kinesis data streams pricing example walks through a scenario and concludes:

For $1.68 per day, we have a fully-managed streaming data infrastructure that enables us to continuously ingest 4MB of data per second, or 337GB of data per day in a reliable and elastic manner.

Another AWS instructor made the point that AWS and other cloud services invert the running costs of IT infrastructure.  In a typical enterprise, the running costs of your data center and infrastructure are like an iceberg–10% is explicit (server costs, electricity, etc) and 90% is implicit (payroll, time spent upgrading and integrating systems).  In the cloud world those numbers are reversed and far more of your infrastructure cost is explicitly laid out for you and your organization.

Truly the future.


The UNIX and Linux System Administration Handbook

I’ve been reading the UNIX and Linux System Administration Handbook.  It’s a real tome, with about 1500 pages.  It’s got five authors and some great cartoons, and covers everything from shell scripts to disk to email to system management daemons (check out the table of contents).  No one should ever read this book cover to cover.  That would be just silly.

I’ve been really enjoying picking and choosing chapters to read, however.  The sheer breadth of this book means that anyone with an interest in modern software development can find something useful in it.

Given my interest in AWS, I read all the sections about cloud computing.  These were high level and not super interesting to me, but I think they’d be great if you were a novice about cloud computing, and they did have a great survey of the major public cloud providers and when it made sense to use each of them.

Then I moved on to the networking sections.  I honestly can say that I didn’t understand fundamental routing protocols before I read that section.  This is obviously closer to the heart of system administration, and the authors did a great job with concepts and hands on knowledge of networking.

After that I moved on to containers.  Did you know that Docker is the new hotness?  I had heard of it, but didn’t understand why.  Now I do.  It’s hot for much the same reason as the ‘fat jar’ deployment is preferred in java land.  Having one single artifact that rolls up code and dependencies is a way to simplify deployments of production code, including rollbacks.  The authors focus on the fundamentals of containers, primarily Docker, but they also cover various orchestration layers like Mesos and Kubernetes.

I’m now in the middle of a chapter about continuous integration and continuous deployment, where they are discussing the concepts as well as Jenkins, one of the key technologies (see, I told you everyone could find something in this book).  After that, I look forward to reading about configuration management.

If you work in software at all and are involved in production systems, you’ll be able to find something in the UNIX and Linux System Administration Handbook (and if you aren’t, I’d be interested in knowing who owns that responsibility).


Restoring a single table from an Amazon RDS backup

material-icon-1307676_640When you use SQL, how do you write delete statements at the database prompt?

A delete statement typically looks like this: delete from table_name where column_name = 'foo';. I usually write it in this order:

  1. delete
  2. delete where column_name = 'foo';
  3. delete from table_name where column_name = 'foo';

Even though this is a pain because you have to move back and forth (I really need to look into vi keybindings for mysql), it prevents you from making sending this command by accident: delete from table_name; which deletes all the data in your table.  (Another alternative is to never use the interactive client and always write out your delete statements in a file and run that file to delete data.)

But, recently, I did exactly that, because I forgot.  I deleted all the data from one table in our production database.  It was billing data, so rather important.  Luckily, I am using Amazon RDS and had set up backup retention.

I wanted to outline what I did to recover from this.

  • I took a deep breath.
  • I wrote a message on the slack channel documenting what had happened and the possible customer impact.
  • Depending on which data is removed, it’s possible you will want to put the application in maintenance mode and/or inform your customers of the issues.  What I deleted was used rarely enough that I didn’t have to take these steps.
  • I looked at how to restore an Amazon RDS backup.
  • I restored the missing data.
  • I communicated that things were back to normal to internal stakeholders.

Unfortunately, it wasn’t clear how to restore a single table.  I’m used to being able to download a .sql file and hand edit it, but that’s not an option.  Stackoverflow wasn’t super helpful.   But if there’s anytime you want clarity, it’s when you are restoring production data.  You don’t want to compound the problem by screwing up something else.

So, here’s how to restore a single table from an Amazon RDS backup:

  • Note the time just before you deleted the data.  (Another reason the slack message is nice.  chatops ftw.)
  • Start up another instance from that moment.  I named it something obvious like ‘has-data-from-tablename’.
  • Twiddle your thumbs anxiously while the new instance starts up.
  • The instance is put into your default security group (as of this writing) which probably doesn’t allow mysql access.  Make sure you modify this security group to allow access.
  • When the instance is up, do a dump of the table you need: mysqldump -t --ssl-ca=./amazon-rds-ca-cert.pem -u user -ppassword -h has-data-from-tablename.c1m7x25w24qor.us-east-1.rds.amazonaws.com -P3306 database_name tablename > restore-table_name.sql; (-t omits the create database/table statements.)
  • If your table is has had writes since you deleted everything, you may need to manually pull down the current data from the production system and merge it into restore-table_name.sql; I was able to avoid this step.
  • Load the data using mysql mysql --ssl-ca=./amazon-rds-ca-cert.pem -u user -ppassword -h production.c1m7x25w24qor.us-east-1.rds.amazonaws.com -P3306 database_name < restore-table_name.sql;
  • Review to make sure the data is correct.
  • Test the application.
  • Update the slack channel, and do any other notifications you need to (customers, internal contacts, etc).
  • Revoke the default security group access you allowed above.
  • Delete the ‘has-data-from-tablename’ instance.

Note this only works if you caught your mistake within the backup retention window. (Make sure you set that up.)  We aren’t multi AZ or clustered, so I’m not sure how that would affect things.

Happy deep breathing!


Bare minimum of ops tasks for heroku

Awesome, you are a CTO or founding engineer of a newborn startup.  You have an web app up on Heroku and someone is paying you money for it!  Nice job.

Now, you need to think about supporting it.  Heroku makes things way easier (no racking and stacking, no purchasing hardware, no configuring apache) but you still to set up some operations.

Here is the bare minimum you need to do to make sure you can sleep at night.  (Based on a couple of years of heroku projects, and being really really cheap.)

  • Have a staging environment
    • You don’t want to push code direct to prod, do you?
    • This can be a free dyno, depending on the complexity of your app.
    • Pipelines are nice, as is preboot.
    • Cost: free
  • Have a one line deploy.
    • Or, if you like CD/CI, an automatic deploy or a one click deploy.  But make it really easy to deploy.
    • Have a deploy script that goes straight to production for emergencies.
    • Cost: free
  •  Backups
    • User data.  If you aren’t using a shared object store like S3, make sure you are doing a backup.
    • Database.  Both heroku postgresql and amazon RDS have point and click solutions.  All you have to do is set them up.  (Test them, at least once.)
    • Cost: freeish, depending on the solution.  But, user data is worth spending money on.
  • Alerting
    • Heroku has options if you are running professional dynos.
    • Uptimerobot is a great free third party service that will check ports every 5 minutes and has a variety of alert options.  If you want SMS, you have to pay for it, but it’s not outrageous.
    • Cost: free
  • Logging
    • Use a logging framework (like slf4j or the rails logger, and mark error conditions with a string that will be easy to search for.
    • Yes, you can use heroku logs but having a log management solution like papertrail will make you much happier.  Plus, it’s free for 2 days of logfiles.
    • Set up alerts with papertrail as well.  These can be more granular.
    • Cost: free
  • Create a list of third party dependencies.
    • Sign up for status alerts from these.  If you have pro slack, you can have them push an email to a channel.  If you don’t, create an alias that receives them.  You want to be the person that tells your clients about outages, not the other way around.
    • Cost: free
  • Communication
    • Internal
      • a devops_alert slack channel is my preferred solutions.  All deploys and other alerts go there.
    • External
      • create a mailing list for your clients so you can inform them of issues easily.  Google groups is fine, but use whatever other folks are using.  Don’t use an alias in your email–you’ll forget to add new clients.
      • do not use this mailing list for marketing purposes, unless you want to offload the burden of keeping the list up to date to the marketing department.
      • do make sure when you gain or lose clients you keep this up to date
    • Run through a disaster in your mind and make notes on how you would communicate the issue, both internally and externally.  How often do you update your team?  How often do you update your clients?  What about an internal issue (some of your code screwed up) vs an external issue.  This doesn’t need to be exhaustive, but thinking about it ahead of time and making some notes will help you in the crisis.
    • Cost: free

All of this is probably a four hour project, max.

But once this is done, you’ll rest easier at night, knowing you have what you need to troubleshoot and recover from production issues.


Heroku drains

drain photoSo, I’ve learned a lot more than I wanted to about heroku drains. These are sinks to which heroku applications can write.  After the logs are out of heroku, you analyze these logs just as you would in any other application living outside of a PaaS.  Logs are very useful to see long term trends, debug, etc.  (I’ve worked both on a rails3 app and a java spring/camel app that are deploying to heroku.)

Here are some things I’ve learned:

  • Heroku drains are well documented.
  • You want definitely want them for any production application, because only 1500 lines of heroku logs are retained at any one time.
  • They can go to either syslog (great for applications with a lot of other infrastructure) or https (great for applications without as much infrastructure support).
  • They can’t do any kind of authorization.
  • You can’t know what ip address the logs are coming from, so you can’t limit access by IP.
  • There are third party extensions you can pay for to avoid dealing with drains at all (I’ve heard good things about papertrail.)
  • You can use logstash to pull heroku logs from a syslog drain into elastic search.
  • There are numerous github projects that can drain to databases, etc.  There’s even one that, with echos of Ouroboros, drains to another heroku app.
  • Drains have intelligent behavior if your listener (or listeners) fails.  From heroku support: “The short answer is yes, the drain will drop logs when the sink is not responsive, but this isn’t really the full story. There are a number of undocumented limits and backoff retries that happen when a drain connection is lost.”  And then they go on to explain how the backoff behaviour happens.  I’m not going to cut and paste their entire answer because I assume it is undocumented for a reason (maybe it changes, maybe they don’t want to commit to supporting this behavior).  Ask them yourself 🙂
  • A simple drain can be as easy as <?php error_log(file_get_contents('php://input'), 3, "/var/log/logfile.log"); ?>, but make sure you rotate that log file.
  • You can use puppet to manage drains if you are bringing servers up and down, using the heroku toolbelt and CLI authentication.

If you are deploying anything beyond a toy app on heroku, don’t forget the ops folks and make sure you set up your drain!


Masterless puppet and CloudFormation

I’ve had some experience with CloudFormation in the past, and recently gained some puppet expertise.  I thought it’d be great to combine the two, working on a new project to set up the ELK stack for a client.

Basically, we are creating an ec2 instance (or a number of them) from a vanilla image using a CloudFormation template, doing a small amount of initialization via the UserData section and then using puppet to configure them further.  However, puppet is used in a masterless context, where the intelligence (of knowing which machine should be configured which way) isn’t in the manifest file, but rather in the code that checks out the modules and manifests. Here’s a great example of a project set up to use masterless puppet.

Before I dive into more details, other solutions I looked at included:

  • doing all the machine setup in UserData
    • This is a bad idea because it forces you to set up and tear down machines each time you want to make a configuration change.  Leads to a longer development cycle, especially at first.  Plus bash is great for small configurations, but when you have dependencies and other complexities, the scripts can get hairy.
  • pulling a bash script from s3/github in UserData
    • puppet is made for configuration management and handles more complexity than a bash script.  I’ll admit, I used puppet with an eye to the future when we had more machines and more types of machines.  I suppose you could do the same with bash, but puppet handles more of typical CM tasks, including setting up cron jobs, making sure services run, and deriving dependencies between services, files and artifacts.
  • using a different CM tool, like ansible or chef
    • I was familiar with puppet.  I imagine the same solution would work with other CM tools.
  • using a puppet master
    • This presentation convinced me to avoid setting up a puppet master.  Cattle not pets.
  • using cloud-init instead of UserData for initial setup
    • I tried.  I couldn’t figure out cloud-init, even with this great post.  It’s been a few months, so I’m afraid I don’t even remember what the issue was, but I remember this solution not working for me.
  • create an instance/AMI with all software installed
    • puppet allows for more flexibility, is quicker to setup, and allows you to manage your configuration in a VCS rather than a pile of different AMIs.
  • use a container instead of AMIs
    • isn’t docker the answer to everything? I didn’t choose this because I was entirely new to containerization and didn’t want to take the risk.

Since I’ve already outlined how the solution works, let’s dive into details.

Here’s the UserData section of the CloudFormation template:


          "Fn::Base64": {
            "Fn::Join": [
              "",
              [
                "#!/bin/bash \n",
                "exec > /tmp/part-001.log 2>&1 \n",
                "date >> /etc/provisioned.date \n",
                "yum install puppet -y \n",
                "yum install git -y \n",
                "aws --region us-west-2 s3 cp s3://s3bucket/auth-files/id_rsa/root/.ssh/id_rsa && chmod 600 /root/.ssh/id_rsa \n",
                "# connect once to github, so we know the host \n",
                "ssh -T -oStrictHostKeyChecking=no git@github.com \n",
                "git clone git@github.com:client/repo.git \n",
                "puppet apply --modulepath repo/infra/puppet/modules pure-spider/infra/puppet/manifests/",
                { "Ref" : "Environment" },
                "/logstash.pp \n",
                "date >> /etc/provisioned.date\n"
              ]
            ]

So, we are using a bash script, but only for a little bit.  The second line (starting with exec) stores output into a logfile for debugging purposes.  We then store off the date and install puppet and git.  The aws command pulls down a private key stored in s3.  This instance has access to s3 because of an IAM setup elsewhere in the CloudFormation template–the access we have is read-only and the private key has already been added to our github repository.  Then we connect to github via ssh to ‘get to know the host’.  Then we clone the repository containing the infrastructure code.  Finally, we apply the manifest, which is partially determined by a parameter to the CloudFormation template.

This bash script will run on creation of the EC2 instance.  Once this script is solid, if you are testing adding additional puppet modules, you only have to do a git pull and puppet apply to add more functionality to the modules.  (Of course, at the end you should stand up and tear down via CloudFormation just to test end to end.)  You can also see how it’d be easy to have the logstash.conf file be a parameter to the CloudFormation template, which would let you store your configuration for web servers, database servers, etc, in puppet as well.

I’m happy with how flexible this solution is.  CloudFormation manages the machine creation as well as any other resources, puppet manages the software installed in those machines, and git allows you to maintain all that configuration in one place.


Gluecon 2015 takeaways

Is it too early to write a takeaway post before a conference is over? I hope not!

I’m definitely not trying to write an exhaustive overview of Gluecon 2015–for that, check out the agenda. For a flavor of the conversations, check out the twitter stream:


Here are some of my longer term takeaways:

  • Better not to try to attend every session. Make time to chat with random folks in the hallway, and to integrate other knowledge. I attended a bitcoin talk, then tried out the API. (I failed at it, but hey, it was fun to try.)
  • Talks on microservices were plentiful. Lots of challenges there, and the benefits were most clearly espoused by Adrian Cockroft: they make complexity explicit. But they aren’t a silver bullet and require a certain level of organizational and business model maturity before it makes sense.
  • Developer hiring is hard, and it will get worse before it gets better. Some solutions propose starting at the elementary school level with with tools like Scratch. I talked to a number of folks looking to hire, and at least one presenter mentioned that as well at the end of his talk. It’s not quite as bad as 2000 because the standards are still high, but I didn’t talk to anyone who said “we have all the developers we need”. Anecdata, indeed.
  • The Denver Boulder area is a small tech community–I had beers last night with two folks that were friends of friends, and both of them knew and were working with former colleagues of mine. Mind that when thinking of burning that bridge.

To conclude, I’m starting to see repeat folks at Gluecon and that’s exciting. It’s great to have such a thought provoking conference which looks at both the forest and the trees of large scale software development.


Thoughts on Amazon CloudFormation

cloud formation photo

Photo by eschipul

I recently set up Amazon CloudFormation for a fairly complicated application in AWS.  For those unfamiliar with this service, it allows you specify a number of AWS resources in a declarative way in a JSON document, create them all at once (it’s called a ‘stack’), manage them as one entity, and destroy them.  You are billed just as you would be if you created the resources by hand.  But it’s a versionable, replicable way to create resources.

The distributed application for which I was creating the stack had the following components:

  • queues (SQS)
  • databases (dynamodb, including secondary indices)
  • compute (EC2)
  • alarms (Cloudwatch)
  • storage (S3)
  • a VPC and Subnets
  • event logging (kinesis)
  • hadoop (Elastic Map Reduce)

The last four items were not configured by the CloudFormation template I wrote.  S3, VPC and subnets because I leveraged existing resources, and Kinesis and EMR because they are not supported by CloudFormation.  (Kinesis has some support, but CloudFormation doesn’t allow you to specify a name of a stream, which makes it pretty useless when you want to post or read from a specific stream.)  However, while it would be preferable to have everything specified in CloudFormation, partial stack creation was useful–I just documented the other requirements in the CloudFormation template–because:

  • resource configuration like queue timeouts, names, read throughput, etc can be applied uniformly–consistency is enforced.
  • the infrastructure is defined and documented in one place, allowing a new developer to get up to speed quickly.
  • tags can be applied uniformly.
  • CloudFormation supports parameters, so that you can preface every resource with a deployment environment specific variable (‘stage’, ‘dan-dev’, etc), or have different DynamoDB throughput for different deployments.
  • if different configuration needs to be tested, you can stand up a new stack in minutes and test it.
  • the template can be stored in your version control system, allowing someone to see how things changed over time.  Yay, commit logs!

There were some other possible benefits I just didn’t have time to explore fully before the project wound down.

  • autoscaling groups seemed like they’d be extremely useful.  These aren’t a CloudFormation only tool, but CloudFormation seems an ideal way to define and use them.
  • the ability to create and delete stacks opened up the possibility of creating developer specific environments for debugging issues.

If you are going to start with CloudFormation, I highly recommend setting up an initial environment by hand, and then running CloudFormer, a small application written by Amazon which reads from your existing AWS infrastructure and generates a CloudFormation template.  I used CloudFormer to create a template for everything in our AWS account, and then picked and chose what was pulled over to the new template.  There were a few issues with this though:

  • There was a bug in the CloudFormation documentation for DynamoDB schemas.  You want to use this syntax: "KeySchema": { "HashKeyElement": { "AttributeName": "attrname", "AttributeType": "S" }, ... }.  CloudFormer generated them correctly, however.
  • CloudFormer coerces names of some resources resources including VPCs and subnets to strings, and I had to back those out when I wanted to use existing resources.

Other than not being able to fully define an application (because of dependencies on unsupported AWS tools like Kinesis and EMR), what other downsides does CloudFormation have?

  • it locks you into AWS.  Openstack Heat is an alternative that works across clouds, or so I read.  And, really, once you decide on AWS, is a infrastructure creation script going to be the one thing that keeps you from moving?
  • it is tied to infrastructure creation (though there is resource by resource support for in place updates).  If you want to modify one queue setting, you have to tear down and create anew the entire stack.  I found this to be relatively quick (15 min or so).
  • you are still writing scripts in the UserData section of the EC2 definition to set up your server environment.

After this experience, and reviewing my thoughts above, I believe the sweet spot of CloudFormation is setting up dev and QA environments quickly, and documenting infrastructure choices when you are committed to AWS.



© Moore Consulting, 2003-2017 +