Fixing the RubyGems “Too Many Requests 429” error

lots of gemsA server on which I am working runs this command: /usr/bin/gem install --no-rdoc --no-ri aws-sdk to get the aws-sdk. I was seeing this error message:

Error: Execution of '/usr/bin/gem install --no-rdoc --no-ri aws-sdk' returned 1: ERROR:  While executing gem ... (Gem::RemoteFetcher::FetchError)
    bad response Too Many Requests 429 (https://api.rubygems.org/api/v1/dependencies?gems=aws-sdk-elasticloadbalancingv2)

Every time I ran it I’d see a different gem that triggered the 429 response. There wasn’t much out there when searching, other than a note that I should update to a new version of bundler (which I wasn’t using).

Finally, I figured out how to get past this. What I did was manually run /usr/bin/gem install -f --no-rdoc --no-ri aws-sdk multiple times, and each time the command would get a little further. Finally all the dependencies had been downloaded. Then I was able to run it without the -f switch after that.


Obstacles to building high availability software systems

Open sign

Is your system available?

I saw a discussion on a slack about obstacles to high availability systems and wanted to record the edited version for posterity (mostly for future me, as I blog for myself). Note that in any mention of high availability systems would be remiss if I didn’t mention the Google SRE book, which is slow reading but free and full of great information.

First, what is high availability? I like this definition from Digital Ocean:

In computing, the term availability is used to describe the period of time when a service is available, as well as the time required by a system to respond to a request made by a user. High availability is a quality of a system or component that assures a high level of operational performance for a given period of time.

Design considerations of a system that will hinder high availability fall into two categories.

The first category is actions that you don’t take, but could take:

  • single points of failure: if you have a piece of your system which is unique and it fails (and everything fails, all the time), the entire system’s availability will be affected.
  • missing or incomplete automation: if you need human beings to resurrect failed parts of your system, it will meaningful amounts of time and will be error prone.
  • failing to build in elasticity and scalability of resources: when usage increases, new resources should be automatically brought online. Failure to do so will impact system performance and that could impact system availability
  • missing or incomplete system instrumentation: if you don’t monitor your system, you won’t be able to even know its availability (until you hear from your users).
  • application statefulness (on the compute nodes): this impacts your ability to use elastic resources and to grow parts of your system that are under load. (If you aren’t designing a greenfield system, this may be an externally imposed requirement due to existing software.)

The second is in actions you can’t take because of external requirements on the system:

  • data sovereignty: if you are legally limited to certain data centers, you have fewer options for your system, this can hinder building the system.
  • tenancy: if you need to have single tenancy for security or legal reasons, you may have fewer options for elastic solutions.
  • data models and authority requirements: poorly performing data models can impact performance. If your application requires certain operations must be from the source of record (permissions checks, for example) then a poorly performing source data model can impact performance which can impact availability.
  • latency: if you have a highly latency sensitive system, then you may need to trade availability for decreased latency. Since availability often means geographic dispersion (to avoid disasters impacting multiple pieces of a system), it impacts latency requirements.
  • cost: high availability systems, because they have no single points of failure, cost more.

Again, this was a discussion from a slack of AWS instructors, but the commentary is mine, as are any mistakes. Thanks to Chad, Richard, Jon, Ryan and everyone else!


Who’s Afraid of Continuous Deployment?

Fish leaping to a larger pool

Leaping to larger pool

So, who’s afraid of continuous deployment? I am, for one. And I’m not alone. I taught hundreds of people in AWS courses over the past two years. We often discussed continuous delivery and deployment and I asked if this was practiced at their places of work. I’d say about 5-10% of folks said yes. I conducted a very informal survey across two technical slacks as well. Unfortunately I had my terms wrong and asked about continuous delivery:

Wanted to do a quick poll. Can you please give a thumbs up to this message if you or your team does continuous delivery of your software product, and a thumbs down if you don’t. And a :penguin: if it doesn’t apply?

The results were:

  • Did CD: 27
  • Did not do CD: 25
  • Does not apply: 3

In the poll, I defined continuous delivery as “if a change is merged to the mainline branch and passes all the tests, it is deployed to production (or whatever environment your customers see) without human involvement”. This was actually a source of discussion, as some folks were very close to this (they deployed to beta environments where only a few customers saw it, or required one human to push a button to actually release, but everything up to that point was automated). Also, someone shared this link about the difference between continuous delivery and continuous deployment. Turns out I was using the term continuous delivery incorrectly. What I defined as continuous delivery was actually continuous deployment. Whoops!

That said, it was interesting that a large number of folks did not deploy code automatically, almost half (note that I believe the poll had a bias because I asked in one slack on the #devops channel. The numbers from the other slack had less than half doing continuous deployment). I’ve worked at a number of small startups, some without paying customers, and I’ve never worked in a place with continuous deployment. I’ve been in jobs with continuous integration and continuous delivery (and this provides a lot of value) but not continuous deployment. I wanted to talk about some reasons why.

The first reason is that continuous deployment simply doesn’t apply. If you are building software that is deployed to customer sites (on-prem), or is tied to hardware, then it doesn’t make sense to work toward CD because there will always be a manual delivery component. Another reason why it might not apply is legal compliance. Folks in the slacks pointed out that in some regulatory regimes you legally are required to have a human ‘push a button’ to deploy because more than one person needed to be involved in a code deploy to satisfy the law and the auditors. These are totally legitimate reasons for not doing continuous deployment.

Next, let’s discuss the reasons based on fear or lack of software hygiene (automated tests or a robust type system). Before I step into this, I want to acknowledge that there may be times in the life of your business where such software hygiene is detrimental to your chances of survival–you need to get an MVP out and test your value in the market, for example. However, in my years of experience I find that following proper software hygiene is far easier to do if adhered to from the beginning. If you don’t, eventually the difficulty of changing the system will grow along with its complexity. You can bolt on testing later, but it is difficult.

I also want to emphasize that I’ve been in all these situations myself. In some ways this blog post is a warning for future me when I try to shirk these practices.

  • If you don’t have automated test coverage, continuous deployment is reckless. This often happens in systems where the testing was bolted on after the system had been developed for a while. The solution is to work towards having enough test coverage to give yourself confidence (it swaddles your code).
  • A system may have configuration deeply tied to a database. Many content management systems are in this boat, which makes it very difficult to roll new configuration forward automatically.
  • Not having an automated rollback strategy. If you are going to continuously deploy, you need to have a way to rollback with confidence, with one script. If you are on heroku, heroku rollbacks help here. If you are running rails code, you can use db:rollback but you’ll need to know how many steps to rollback (I couldn’t find anything that rolled all migrations back to a given timestamp) and you’ll want to be careful about losing data. It may make more sense to run migrations in a different release, and always have the code be backward compatible. Lots of interesting reading about that strategy in strong_migration’s docs. This solution will vary from application to application.
  • Not having enough users to safely canary. One way to know if your new release has problems is to do a blue/green deployment and send just a fraction of your traffic there (you could use a weighted DNS round robin solution). But if you only have a small number of users, the canary userbase won’t adequately run through all the code paths.
  • Fear of breaking key user flows. At a recent company we did basic manual regression tests just before deployment. These could have been easily automated via selenium and would have made sure that at least basic functionality was available. Also see this post from 2013 on smoke testing.

All of these are not really technical issues, they’re prioritization issues. At this point in time most web applications can be continuously deployed. The tooling and the knowledge is out there, given the business and technology teams commitment.

However, this in some ways sidesteps the real question. Why is continuous deployment a goal worth prioritizing, especially when the team has to spend time supporting that instead of giving customers more features? CD is extra work to set up, but once it is running then you can deliver features at a very rapid pace, and you never have a feature sitting around waiting for other orthogonal features. So, in a way, it will actually lead to more features and better development. There’s also the long term benefits of software hygiene for the ability of the system to evolve.


Ephemeralization is the future

In the same vein as “write less software“, this is a great post about ephemeralization. To ephemeralize software is to make it disappear either by pushing it to another provider or just removing the feature. From the post:

Proposals to ephemeralize a component or feature will sometimes be met with emotionally-charged responses from your team. It’s totally reasonable to feel attached to a component everyone has worked hard on and has been important historically. But realize that the component has value because it got you to where you are today, not necessarily because of its ongoing existence in the future.

Path dependence of software is very real. What you’ve done before determines what you can easily do in the future. But you can take large right turns if you are prepared to make the investment. As the above quote notes, it’s not just an investment in time, knowledge and money, it can also be an emotional investment.


Heroku and SSL

Heroku is a great hosting platform. Using it lets you focus on your application and not worry about operations tasks that might otherwise take developer time. Tasks like updating the operating system, patching the web server, and configuring third party services like monitoring. There are limits of course–once you reach a certain size it can be pricey. And if you have a app that has special requirements (example here), you might have to jump through some hoops. But if you are building a typical web app backed by a relational database, I’d highly recommend Heroku.

SSL is the technology which ensures that data sent over the web is transmitted untampered. Heroku has a number of offerings helping to make SSL easier to install; here’s a page to help you decide between the offerings. Last year Heroku announced support for free, automatically renewable SSL certificates. It does have some limits (no wildcard support, for example).

I just finished updating our system from an older SSL solution to the automated certificate management offering. Other than a slight hiccup around DNS being set up incorrectly by me (quickly fixed after a helpful answer from Heroku support), this update went swimmingly. Now rather than having to deal with SSL updates once a year (reviewing my notes, googling around and paying for a third party certificate), the SSL certificate is updated automatically.

This is just another example of Heroku taking something that is key to operating a web application and making it trivial. What a relief.


CircleCI shutting down version 1.0

I’ve been a happy user of CircleCI at multiple companies. Right now we pay them at The Food Corridor and they handle almost all our deployments. (I still deploy to production manually.)

We just got a note that they are shutting down their 1.0 offering and will not support it as of Aug 2018. The 2.0 offering was announced in 2016 and generally available in 2017. So, there will be about one year of overlap. Not too long.

I understand that desire to move forward.Trust me, I do.

I don’t know how much engineering effort it takes to support the two versions, but my guess is that they’ll see some significant customer loss from this. Why? CI is something that you just want to work. You don’t want to think about it. Which is why a SaaS solution makes so much sense. I am happy to just keep paying them month after month for their excellent product.

But, if I have to take some cycles to move from CircleCI 1.0 to CircleCI 2.0, why wouldn’t I take some time and evaluate other solutions too? I assume they’ve run the numbers and the amount of money it takes to support 1.0 must be more than the amount they will lose via churn.

AWS does a good job of this–they never deprecate anything (you can still set up SimpleDB if you want). They just hide it, make other offerings better, and make older offerings more expensive.

In fact, if I were CircleCI, I might offer a ‘legacy’ CircleCI 1.0 plan, where people with significant investments in the older infrastructure can pay more for access to that old codebase. Depending on the amount of support required, that might be some significant free money.

Relatedly, Amy Hoy has a great post on how to get your customers to pay you more money.



© Moore Consulting, 2003-2017 +