Skip to content

Open source software business model concerns in the age of the public cloud

A few years ago at Gluecon I was sitting at the lunch table with some new acquaintances. The talk turned to business models and someone mentioned the chilling effect of the public cloud providers on open source businesses. Companies whose main product was open source software (OSS) originally had a straightforward business model: sell support and consulting. (I’m leaving aside sponsorships, donations, dual licensing and other smaller streams of income. I’m focusing on what I’ve seen work to build larger, stable companies. If you are solopreneur you have a different set of options.)

This shifted over time to withholding certain features (the “open core” model) and licensing them to customers who wanted them. Open core was better for companies because it is more scalable and profitable to sell software than it is to sell labor (which is what support and consulting essentially are). Customers won because a certain set of them needed features that were likely not core to the software, such as single sign on or auditing, but were critical for their use cases. These customers tended to be businesses with money to spend. Open core, however, has some problems.

The fundamental one is “what should be OSS and what should be licensed” becomes a pressing question. Building a product is tough enough in general, but now there’s an additional marketing, product and engineering complexity.

The SaaS Business Model Cometh

With the rise of SaaS a third monetization option appeared: offer their software as a service and run it on behalf of their customers. NodeBB, ElasticSearch  and others all do this. The customer wins because they have a lower TCO and benefit from continual updates and the company wins because SaaS combines the margins of open core with recurring revenue. This has even caused some companies to move away from open core. The community remains satisfied because if they want, they can self-host.

Enter the cloud companies. Cloud companies could and did take open source projects, packaged them up and offered them as a service for a low monthly fee.

As a customer, and I’ve been such a customer many times, using an offering from these providers has definite benefits:

  • The cloud operators know how to run software at scale.
  • They are already part of the accounting infrastructure. There’s no additional procurement process.
  • They don’t typically deprecate software (killedbygoogle notwithstanding), so as a customer you are less worried ongoing support and availability.

Adoption by a cloud provider has benefits for the open source project, too.

  • It’s proof the software has reached a large level of customer demand, otherwise the big clouds wouldn’t bother.
  • Improvements to the software or documentation may be donated back.
  • It increases the number of folks who use the software or know about it.

However, those benefits may not balance out the loss or threat of loss of revenue from SaaS hosting. This was the chilling effect mentioned around that lunch table. When I read or ask folks about this issue, I’ve heard three separate proposed solutions:

  • Be better
  • You should be so lucky
  • Close your source or re-license

Be Better

First, as the prime mover behind the product, a company should be able to be better than the cloud operators. Not necessarily better at running all software, but better at running theirs. After all, they built it. This does assume that the best people to operate a service are the ones who built it. This may or may not be true. I think it is definitely true at a small scale–they are going to know which knobs to tweak, and if they run into an issue they can immediately fix it in the code base. As software gets more and more popular, the ability to run it well diffuses, however.

The company behind any project also should certainly have a better sense of customer needs and the ability to map those to its internal roadmap. Of course, being an open source project means that the advantages are fleeting. If anything is turned into code and released, it made available to competitors.

An even bigger issue is that this relies on customers being willing to stick with the company on the bleeding edge. For rapidly growing or changing software that may make sense. Where major functionality is being built regularly, upgrades are an easy sell. But OSS isn’t adopted by the cloud provider until it has a large audience, which means they are likely very functional for a lot of use cases. This means that the company won’t get as many folks on the upgrade train, which makes the “be better” argument more problematic. If the cloud provider can offer a “good enough” experience, then their other advantages start to dominate.

You Should Be So Lucky

This is the argument that the chances of an application becoming popular enough to be adopted by a public cloud are so low that it’s not worth spending time thinking about. I asked Steve O’Grady about the OSS business model predicament at a webinar once and this was his answer.

There’s a lot of merit to this. The biggest obstacle to an open source company succeeding is not anyone taking the software and running it, it’s the software not being useful or known. Obscurity has killed many a company, far more than AWS adopting their software.

But when a company reaches a certain size, this advice is less of a solution and more naming a strategic threat. Yes, they are lucky to have their solution be popular. But what can they actually do about the threat to what may be a sizable portion of their revenue? This argument doesn’t help with that.

Close Your Source

Depending on how the company developed the software, they could relicense or close source. This may require some prep work. First, make sure the software is all owned by the company; have all contributors assign over their copyrights. This will make it more difficult for you to reap some of the benefits of open source: easy contributions from others.

The company can also choose to relicense only certain portions of the codebase, taking a step towards open core. In any case, prepare to consult an IP lawyer.

The company must also be ready for the blowback from the community. Which hurts more, losing something that was once free or paying for something from the get-go? For me, it’s the former. Many of the marketing, engineering and product benefits gained from being open source will be lost; this is the price to pay for obviating the cloud vendor SaaS threat.

The firm could also dual license the software or use something like the Business Source License, which strikes a different balance, but still has some attributes of an open source license.

Alternatives

None of these sound very fun. I think the answer is to back up and consider whether to make a project open source initially. Consider this very carefully initially to avoid pain.

The benefits of open source for a business’ core offering are:

  • Lower friction of adoption
  • Contributions from outside an organization (these must be fostered, they may be free in terms of money but not time)
  • Increased rigorousness for the eng team (coding in public can make code better)
  • Easier to recruit because devs like working on open source
  • More eyes on the code means shallower bugs
  • Marketing halo

Are these worth the risk? I can’t answer that authoratively. If they are, another option might be to prepare for SaaS revenue cannibalization; this may mean going open core, focusing on consulting or support, or possibly never even building a SaaS solution.

There’s more than one way to be open. A company can do open development, inviting customers in to the product process. Would being more open with the development process (using public GitHub issues, for example, instead of an internal issue tracker) allow a company to get some level of contribution from outside the org? This would also imposes that “spotlight level rigor” on the eng team. Could the company make the software “free as in beer”, rather than “free as in speech” to lower adoption costs?

A firm can also release software as open source without taking any feedback from the community (“throw it over the wall”), which may be a good choice if the software isn’t core. Big companies do this every day. Facebook gladly open sources their computer plans and React, but certainly wouldn’t open source the core Facebook application.

In conclusion, if you are planning to build a company on an open source project, think about the main revenue flows for that company and what you’ll do in the unlikely, but hopeful, event the project succeeds.

Use managed services. Please.

“Use managed services.”

If there was one piece of advice I wish I could shout from the mountains to all cloud engineers, this would be it.

Operations, especially operations at scale, are a hard problem. Edge cases become commonplace. Failure is rampant. Automation and standardization are crucial. People with experience running software and hardware at this scale tend to be rare and expensive. The mistakes they’ve made and situations they’ve learned from aren’t easy to pick up.

When you use a managed service from one of the major cloud vendors, you’re getting access to all the wisdom of their teams and the power of their automation and systems, for the low price of their software.

A managed service is a service like AWS relational database service, Google Cloud SQL or Azure SQL Database. With all three of these services, you’re getting best of breed configuration and management for a relational database system. There’s configuration needed on your part, but hard or tedious tasks like setting up replication or backups can be done quickly and easily (take this from someone who fed and cared for a mysql replication system for years). Depending on your cloud vendor and needs, you can get managed services for key components of modern software systems like:

  • File storage
  • Object caches
  • Message queues
  • Stream processing software
  • ETL tools
  • And more

(Note that these are all components of your application, and will still require developer time to thread together.)

You should use managed services for three reasons.

  • It’s going to be operated well. The expertise that the cloud providers can provide and the automation they can afford to implement will likely surpass what you can do, especially across multiple services.
  • It’s going to be cheaper. Especially when you consider employee costs. The most expensive AWS RDS instance is approximately $100k/year (full price). It’s not an apples to apples comparison, but in many countries you can’t get a database architect for that salary.
  • It’s going to be faster for development. Developers can focus on connecting these pieces of infrastructure rather than learning how to set them up and run them.

A managed service doesn’t work for everyone. If you need to be able to tweak every setting, a managed service won’t let you. You may have stringent performance or security requirements that a managed service can’t meet. You may also start out with a managed service and grow out of it. (Congrats!)

Another important consideration is lock-in. Some managed services are compatible with alternatives (kubernetes services are a good example). If that is the case, you can move clouds. Others are proprietary and will require substantial reworking of your application if you need to migrate.

If you are working in the cloud and you need a building block for your application like a relational database or a message queue, start with a managed service (and self host if it doesn’t meet your needs). Leverage the operational excellence of the cloud vendors, and you’ll be able to build more, faster.

Terraform with multiple workspaces and environments

I recently was setting up a couple of AWS environments for a client. This client had a typical web application which talked to an RDS database. There was DNS, a CDN and other components involved. We wanted to use Terraform to maintain traceability and replicability, and have the same configuration for production and staging, with perhaps small differences like ec2 instance size. We also wanted to separate out the components into their own Terraform workspaces to limit the blast radius (so if one component had changes that caused issues or Terraform corruption, it wouldn’t affect others). Finally, we wanted each environment to have its own Terraform backend, again to separate the environments.

I wasn’t able to complete this project due to external factors (I left the position before testing could be completed), but wanted to share the concepts. Obviously I can’t share the working code, but I set up an example project which is simpler. That’s the project I’ll be examining in this post. I also want to be clear that while I’ve tested this as much as I could and have validated the ideas with others who have more Terraform experience, this hasn’t been run in production. You have been warned. (Here’s the Terraform docs about setting up modules, workspaces and repositories.)

Using a tool like Terraform is great for a number of reasons, but my favorite is that it lets you track changes to cloud infrastructure. More than once I’ve wandered into an AWS account and wondered why certain resources were set up in the way they were, and what might break if I changed them. There are occasionally comments, but it is far better to examine a commit. Even better to review the set of commits and see the customer request or bug tied to it. (Bonus link: learn more about Terraform and other cloudy tools in this podcast episode with the creator of Terraform.)

So this simpler example project has a lambda that writes to an SQS queue. For now, it just writes the date of invocation, but obviously you could have it reach out to an external API, read from a database, or do some kind of calculation. The SQS queue could then be read from by an EC2 instance, which processes the message and perhaps updates a database. You have three components of the system:

  • The lambda function
  • The SQS queue
  • The EC2 instance (implementation of which is left as an exercise for the reader)

The SQS queue is shared infrastructure and needs to be accessed by both of the other systems. However, the SQS system doesn’t need to know about either the lambda or the EC2 instance. Using Terraform, we can create each of these components as their own workspace. Each of the subsidiary systems can evolve or change (for instance, the EC2 instance could be replaced with an autoscaling group) with minimal impact on other systems. They could be managed by different teams as well if that made sense.

To enforce this separation, set up each component as a separate Terraform workspace. (All code is on github here.) I use remote state so that more than one person can manage the terraform state, and use the S3/dynamodb backend because we are targetting AWS and want a free scalable solution. This post assumes you know how to set up Terraform using s3/dynamodb as a remote state storage.

Here’s the outputs of the SQS system:

output "queue_url" {
  value = "${aws_sqs_queue.myqueue.id}"
}

output "queue_arn" {
  value = "${aws_sqs_queue.myqueue.arn}"
}

I explicitly define the output variables so I can pull them in from the lambda and EC2 workspaces. This is how you can do that.

...
data "terraform_remote_state" "sqs" {
  backend = "s3"
  config = {
    bucket = "${var.terraform_bucket}"
    key = "sqs/terraform.tfstate"
    encrypt = true
    dynamodb_table = "terraform-remote-state-locks"
    profile = "${var.aws_profile}"
    region = "us-east-2"
  }
}
...
resource "aws_lambda_function" "mylambda" {
...
  environment {
    variables = {
      sqs_url = "${data.terraform_remote_state.sqs.outputs.queue_url}"
    }
  }
}

The terraform_remote_state block defines the location of the previously defined sqs workspace, and the ${data.terraform_remote_state.sqs.outputs.queue_url} references that url. That is then injected as an environment variable into the lambda, which reads it and uses the url to create an SQS client. It can then post whatever message it wants.

You can see how this would work with any number of configuration parameters. If you have typical three tier database driven application with a separate caching layer you can create each of these major components and inject the values into either the environment (for lambda) or the userdata (for EC2). I’m not sure I’d use this with a microservices architecture because using a services registry might be more appropriate.

Note that the lambda component has a rudimentary lambda function (you have to define something). It also uses Terraform to deploy the lambda code. That’s fine for the toy example, but for production you will want to use a real CI/CD system to deploy your lambdas.

Now, suppose you want to run production and staging environments, because you are ready to launch. Here are the constraints you’d want:

  • Production and staging run the same config (except when staging is changing, of course)
  • Production and staging may differ in a few details (the size of the EC2 instance, for example)
  • Production and staging execute in different AWS accounts to limit access and issues. You don’t want an error in staging to affect production. This is handled by creating different profiles which have access to different accounts.
  • Production and staging execute in different Terraform backends for the same reason as the separate AWS accounts.

Staging and production can use the same git repository, but when pulled down they are kept in two places on the filesystem. This is because you need to specify the profile and the bucket when using terraform init. So you end up running something like these two commands:

git clone git@github.com:mooreds/terraform-remote-state-example.git # staging
git clone git@github.com:mooreds/terraform-remote-state-example.git production-terraform-remote-state-example # production

I set up the project so that staging can be managed by normal terraform commands (since that will happen more often), and that production uses either special incantations or a script. For the initialization of the production Terraform environment, this looks like: terraform init -backend-config="profile=trsproduction" -backend-config="bucket=bucketname". For staging, it’s just terraform init. I didn’t have a lot of luck switching between these two Terraform backends in the same filesystem location, so that having two trees was a straightforward workaround.

Any changes between production and staging are each pulled out to a variable, with the staging value as the default. Then each workspace has a script which applies the Terraform configuration to the production environment. The script sets variables to be the correct value for production. Here’s an example for the lambda workspace:

terraform apply -var aws_profile=trsproduction -var terraform_bucket="mooreds-terraform-remote-state-example-production" -var env_indicator="production" -var lambda_memory_size=256

We pass in the production terraform_bucket in case any references need to be made to the remote state (to pull in the SQS queue url, for example). We also pass in an increased lambda memory size because, hey, it’s production. Other things that might vary between environments: for example, VPC or subnet ids, API endpoints, and S3 bucket names.

For simplicity, we just use two profiles for staging and production (in ~/.aws/credentials), but any way of getting credentials that works with Terraform will work:

[trsstaging]
aws_access_key_id = ...
aws_secret_access_key = ...

[trsproduction]
aws_access_key_id = ...
aws_secret_access_key = ...

This lets us separate out who has production access. Some users can have both staging and production profiles (perhaps operations), and others can have only staging profiles (perhaps developers). You can pass region values in via variables as well.

Using this system, the workflow for a change would be:

  • Check out the terraform git repository
  • Create a feature branch (including an issue identifier)
  • Pull request and approval
  • Run terraform apply to apply to staging
  • Run any additional tests
  • Merge to master
  • Run prodapply.sh

Again, I want to be clear that I’ve implemented this partially, but I didn’t get a chance to run this fully in production. I tested all these concepts with the simple system mentioned above (and you can stand up your own using the code on github). There will be issues that I haven’t experienced. But I hope that this post helps illuminate the complexity of managing multiple workspaces and environments within a single Terraform github repository.

Using AWS for load testing experimentation

Someone with heavy weightThe cloud is amazing for load testing your system. If you design your system to be behind a load balancer (which, in many applications, means pushing state to a database and having stateless compute nodes), you can easily switch out those nodes in different scenarios.

I just load tested a system I’m working on and changing out the compute nodes was fairly easy. Once I’d built a number of servers (something I scripted partially but didn’t fully automate because the return wasn’t there) and troubleshot some horizontal scaling issues that popped up in the application, I was able to:

  • take a server out of service behind the load balancer
  • stop it
  • change the instance type
  • start it
  • re-run any needed config changes on the server
  • update DNS if needed (depending on if you have a pinned IP address or not)
  • add it back to the load balancer

Swap out a few instances and you have a new setup for your load test. When you are done, follow the process in reverse to save yourself some money.

Incidentally, increasing the number or size of compute nodes didn’t have the desired effect of being able to handle more load.

What turned out to be the root issue? The database was pegged, both in terms of CPU and connections. Just goes to show that when you’re load testing, you really need to be looking at different aspects of the system, thinking about where your weak bottlenecks are, and use the scientific method of hypothesis, experiment, result.

Follow the money, cloud edition

Clouds in the sky
No, not that kind of cloud
This post was really eye opening and lets you know who are the real players in the public cloud space. I especially enjoyed the metric of capex as percent of revenue. From the post:

As I keep repeating, CAPEX is both a prerequisite to play in the big boy cloud and confirmation of customer success. Both IBM and Oracle are tens of billions of dollars in cloud infrastructure CAPEX behind Amazon, Google, and Microsoft. Oracle’s spending has at least ticked up, but their spending is not enough to keep pace, much less to have any hope of catching up to the infrastructure of the big three.

The whole post is worth reading if you are interested in public cloud providers in any way.

Obstacles to building high availability software systems

Open sign
Is your system available?

I saw a discussion on a slack about obstacles to high availability systems and wanted to record the edited version for posterity (mostly for future me, as I blog for myself). Note that in any mention of high availability systems would be remiss if I didn’t mention the Google SRE book, which is slow reading but free and full of great information.

First, what is high availability? I like this definition from Digital Ocean:

In computing, the term availability is used to describe the period of time when a service is available, as well as the time required by a system to respond to a request made by a user. High availability is a quality of a system or component that assures a high level of operational performance for a given period of time.

Design considerations of a system that will hinder high availability fall into two categories.

The first category is actions that you don’t take, but could take:

  • single points of failure: if you have a piece of your system which is unique and it fails (and everything fails, all the time), the entire system’s availability will be affected.
  • missing or incomplete automation: if you need human beings to resurrect failed parts of your system, it will meaningful amounts of time and will be error prone.
  • failing to build in elasticity and scalability of resources: when usage increases, new resources should be automatically brought online. Failure to do so will impact system performance and that could impact system availability
  • missing or incomplete system instrumentation: if you don’t monitor your system, you won’t be able to even know its availability (until you hear from your users).
  • application statefulness (on the compute nodes): this impacts your ability to use elastic resources and to grow parts of your system that are under load. (If you aren’t designing a greenfield system, this may be an externally imposed requirement due to existing software.)

The second is in actions you can’t take because of external requirements on the system:

  • data sovereignty: if you are legally limited to certain data centers, you have fewer options for your system, this can hinder building the system.
  • tenancy: if you need to have single tenancy for security or legal reasons, you may have fewer options for elastic solutions.
  • data models and authority requirements: poorly performing data models can impact performance. If your application requires certain operations must be from the source of record (permissions checks, for example) then a poorly performing source data model can impact performance which can impact availability.
  • latency: if you have a highly latency sensitive system, then you may need to trade availability for decreased latency. Since availability often means geographic dispersion (to avoid disasters impacting multiple pieces of a system), it impacts latency requirements.
  • cost: high availability systems, because they have no single points of failure, cost more.

Again, this was a discussion from a slack of AWS instructors, but the commentary is mine, as are any mistakes. Thanks to Chad, Richard, Jon, Ryan and everyone else!

Who’s Afraid of Continuous Deployment?

Fish leaping to a larger pool
Leaping to larger pool

So, who’s afraid of continuous deployment? I am, for one. And I’m not alone. I taught hundreds of people in AWS courses over the past two years. We often discussed continuous delivery and deployment and I asked if this was practiced at their places of work. I’d say about 5-10% of folks said yes. I conducted a very informal survey across two technical slacks as well. Unfortunately I had my terms wrong and asked about continuous delivery:

Wanted to do a quick poll. Can you please give a thumbs up to this message if you or your team does continuous delivery of your software product, and a thumbs down if you don’t. And a :penguin: if it doesn’t apply?

The results were:

  • Did CD: 27
  • Did not do CD: 25
  • Does not apply: 3

In the poll, I defined continuous delivery as “if a change is merged to the mainline branch and passes all the tests, it is deployed to production (or whatever environment your customers see) without human involvement”. This was actually a source of discussion, as some folks were very close to this (they deployed to beta environments where only a few customers saw it, or required one human to push a button to actually release, but everything up to that point was automated). Also, someone shared this link about the difference between continuous delivery and continuous deployment. Turns out I was using the term continuous delivery incorrectly. What I defined as continuous delivery was actually continuous deployment. Whoops!

That said, it was interesting that a large number of folks did not deploy code automatically, almost half (note that I believe the poll had a bias because I asked in one slack on the #devops channel. The numbers from the other slack had less than half doing continuous deployment). I’ve worked at a number of small startups, some without paying customers, and I’ve never worked in a place with continuous deployment. I’ve been in jobs with continuous integration and continuous delivery (and this provides a lot of value) but not continuous deployment. I wanted to talk about some reasons why.

The first reason is that continuous deployment simply doesn’t apply. If you are building software that is deployed to customer sites (on-prem), or is tied to hardware, then it doesn’t make sense to work toward CD because there will always be a manual delivery component. Another reason why it might not apply is legal compliance. Folks in the slacks pointed out that in some regulatory regimes you legally are required to have a human ‘push a button’ to deploy because more than one person needed to be involved in a code deploy to satisfy the law and the auditors. These are totally legitimate reasons for not doing continuous deployment.

Next, let’s discuss the reasons based on fear or lack of software hygiene (automated tests or a robust type system). Before I step into this, I want to acknowledge that there may be times in the life of your business where such software hygiene is detrimental to your chances of survival–you need to get an MVP out and test your value in the market, for example. However, in my years of experience I find that following proper software hygiene is far easier to do if adhered to from the beginning. If you don’t, eventually the difficulty of changing the system will grow along with its complexity. You can bolt on testing later, but it is difficult.

I also want to emphasize that I’ve been in all these situations myself. In some ways this blog post is a warning for future me when I try to shirk these practices.

  • If you don’t have automated test coverage, continuous deployment is reckless. This often happens in systems where the testing was bolted on after the system had been developed for a while. The solution is to work towards having enough test coverage to give yourself confidence (it swaddles your code).
  • A system may have configuration deeply tied to a database. Many content management systems are in this boat, which makes it very difficult to roll new configuration forward automatically.
  • Not having an automated rollback strategy. If you are going to continuously deploy, you need to have a way to rollback with confidence, with one script. If you are on heroku, heroku rollbacks help here. If you are running rails code, you can use db:rollback but you’ll need to know how many steps to rollback (I couldn’t find anything that rolled all migrations back to a given timestamp) and you’ll want to be careful about losing data. It may make more sense to run migrations in a different release, and always have the code be backward compatible. Lots of interesting reading about that strategy in strong_migration’s docs. This solution will vary from application to application.
  • Not having enough users to safely canary. One way to know if your new release has problems is to do a blue/green deployment and send just a fraction of your traffic there (you could use a weighted DNS round robin solution). But if you only have a small number of users, the canary userbase won’t adequately run through all the code paths.
  • Fear of breaking key user flows. At a recent company we did basic manual regression tests just before deployment. These could have been easily automated via selenium and would have made sure that at least basic functionality was available. Also see this post from 2013 on smoke testing.

All of these are not really technical issues, they’re prioritization issues. At this point in time most web applications can be continuously deployed. The tooling and the knowledge is out there, given the business and technology teams commitment.

However, this in some ways sidesteps the real question. Why is continuous deployment a goal worth prioritizing, especially when the team has to spend time supporting that instead of giving customers more features? CD is extra work to set up, but once it is running then you can deliver features at a very rapid pace, and you never have a feature sitting around waiting for other orthogonal features. So, in a way, it will actually lead to more features and better development. There’s also the long term benefits of software hygiene for the ability of the system to evolve.

Software infrastructure configuration options

I ran across this great article when I was reading up on Terraform.

It does a good job of running through the options (puppet, cloudformation, etc) on how to set up your infrastructure via software. Here’s a great quote on why they chose Terraform:

On the other hand, with the kind of declarative approach used in Terraform, the code always represents the latest state of your infrastructure. At a glance, you can tell what’s currently deployed and how it’s configured, without having to worry about history or timing. This also makes it easy to create reusable code, as you don’t have to manually account for the current state of the world. Instead, you just focus on describing your desired state, and Terraform figures out how to get from one state to the other automatically.

Serverless Framework

I had coffee with an acquaintance who is doing a lot of event driven data processing. Whereas ten years ago to tackle this problem you might use an ETL tool like Pentaho or Talend, now his process runs entirely on AWS Lambda functions. He is leveraging the Serverless framework to manage and deploy these applications. As I understand it there is a thin shim layer between the business logic and the lambda event handler, but the business logic is isolated and knows nothing about its environment. That makes the business logic very testable.

His description of the Serverless framework intrigued me. As he described it, the framework is driven by a simple yaml file and takes care of, among other tasks, the complicated infrastructure set up to tie Lambda functions to a variety of AWS events. I haven’t done it myself, but I’ve heard that setting up a lambda to API Gateway link is a real bear. Doing so allows a lambda function respond to a web requests without any AWS authentication, and is a key use case.

You can write and deploy lambda functions in any language that AWS Lambda supports (unfortunately, not java 9 at the moment). Here’s a java/maven/serverless tutorial. It also supports multiple cloud providers, though I haven’t done much beyond note that the documentation exists.

However, using Serverless does require writing code. If evaluating a a complicated ETL process which non developers needed to be able to understand and support, Serverless would not be a good fit. I’m not aware of any abstraction layers on top of it, though I guess you could run, for example, Pentaho Kettle jobs within lambda. There’s also an issue around cold start times–when your code hasn’t been invoked for a while, it can take longer to start up when a request or event occurs. Apparently there are partial solutions, but your lambdas still get cycled every few hours regardless.

I worked through some of the tutorials and was impressed at just how easy it was to get started. If I had a simple API or data processing pipeline to build, Serverless would definitely be on my short list of possible implementation options. It is very inexpensive, scales easily and encourages encapsulation.

Incidentally, my acquaintance’s company is hosting a lunch and learn on this technology at the end of the month. More details here.

“The future is already here, but it’s only available as a managed AWS service”

This entire post about how Kubernetes could become the distributed operating system of choice is worth reading.  But one statement really struck me:

Well, as they say, the future is already here, but it’s only available as an AWS managed service.

The “they” in this is apparently not William Gibson, as I thought.  More details here.

For the past couple of years the cloud providers have matured and moved from offering infrastructure as a service (disk, compute) to platform as a service offerings (sqs, which is a managed message queue like activemq, or kinesis, a managed data ingestion system like kafka, etc).  Whenever you think about installing a proprietary or open source package, you should include the cloud provider offerings in your evaluation matrix.  Of course, the features you need may not be there, or the cost may be prohibitive, but including them in an evaluation makes sense because of the speed of deployment and the scaling available.

If you think a system architecture can benefit from a message queuing system, do you want to spend time setting up and maintaining such a system, or do you want to spin up an SQS queue in a few minutes?

And the cost may not be prohibitive, depending on the skillset of your internal team and your team’s desire to run such plumbing services.  It can be really hard to estimate running costs of infrastructure services, though you can estimate it by looking at internal teams and seeing similar services they run and how much money it takes.  The nice thing about cloud services is that the costs are very transparent.  The kinesis data streams pricing example walks through a scenario and concludes:

For $1.68 per day, we have a fully-managed streaming data infrastructure that enables us to continuously ingest 4MB of data per second, or 337GB of data per day in a reliable and elastic manner.

Another AWS instructor made the point that AWS and other cloud services invert the running costs of IT infrastructure.  In a typical enterprise, the running costs of your data center and infrastructure are like an iceberg–10% is explicit (server costs, electricity, etc) and 90% is implicit (payroll, time spent upgrading and integrating systems).  In the cloud world those numbers are reversed and far more of your infrastructure cost is explicitly laid out for you and your organization.

Truly the future.