Let AWS RDS handle database scutwork

Amazon DatabasesRDS is a service I’ve mentioned in the past, but it’s fantastic. You can outsource large chunks of database administration to AWS. Tasks you can forget about include backups, failover, read only replicas, and OS and DB upgrades.

This is a great fit for spinning up databases for small scale to large scale systems and prototyping.

Things to keep in mind if you start using RDS:

  • The database is launched into a VPC and will have a security group around it. You’ll need to allow IP addresses or security groups access to the port the database is living on or your connections will time out.
  • The database RDS creates is a normal database that you can manage like you can any other database you have set up and installed, but there are certain limitations (for example, no MySQL UDFs). Read the documentation and understand the limitations, but be aware they are constantly changing. I suggest subscribing to the AWS Database blog RDS category for updates.
  • RDS uses EBS under the covers and has the performance constraints of that technology. For the largest scale production systems you’ll want to test before jumping in whole hog.
  • If you are using MySQL or PostgreSQL and are running into concurrency problems, Aurora may be worth evaluating.
  • If you want to have backups past thirty five days for peace of mind of compliance concerns, you’ll need manual snapshots.
  • RDS only supports certain RDBMS and limits databases to certain sizes. If you want to run anything else on AWS, you will need to self manage your DB on EC2 or look at other data management solutions. Here are some other gotchas.
  • When using RDS you aren’t freed from all database administration tasks. There are still users to manage, indices to add, and queries to tune. Most of your RDBMS skillset is applicable to RDS, however. You’ll also need to determine when to schedule DB and OS upgrades, backups and how to size your instances. You still need to set up the optimal architecture of an RDS system including standbys and read only replicas and do other configuration both at the network and database level.
  • You can manage RDS system attributes via cloudformation, terraform and the CLI in the same way you can manage other AWS infrastructure. That said, the RDS system is stateful so you can’t treat it entirely as “cattle”.

You can learn more about RDS in the extensive documentation.


AWS Quick Starts

StopwatchIf you are looking to stand up an application quickly, I often recommend the AWS marketplace. This service has thousands of vendor maintained solutions and is a great way to get going quickly. Note that some of the solutions have extra per hour charges, and if that is the case per second billing won’t apply. These solutions are focused on individual AWS EC2 instance images (so you can quickly stand up a phpbb instance or a redmine server, for example).

However, another good option is AWS Quick Starts. These are recipes for deployments, possibly of multiple virtual machines, and are aimed at handling larger business problems. There are over 80 listed on the Quick Start page right now, ranging from creating a data lake to a HIPAA reference architecture to running devops tools like consul and bitbucket. These solutions may or may not carry additional charges, so make sure review licensing and billing information as well as functionality.

If you are thinking about setting up a complex system in AWS, it’s worth some time to see if someone has put a reference Quick Start together. It may not fit your needs perfectly, but can be a good place to begin.


AWS documentation now open source and on Github

TypewriterThis was announced recently. The AWS docs are now available on Github for everyone to review and improve. I love documentation (have for years). I think it’s great that AWS is now allowing PRs against their documentation. Some products have not yet uploaded their docs, ahem. It can only improve the speed of change.

I think it will also give a good glimpse into usage stats of AWS services. If a service doesn’t have any PRs or issues opened, it’s unlikely to be widely used (or, alternatively, it could be totally stable, or have users that don’t use Github). It’d be a fun project to pull the number of contributions to these repos via the Github API and publish that data.

I still feel that guides like og-aws have a place in the world of AWS documentation–opinions and real world stories fit better there than they do in official AWS documentation. And this is still too new to know if PRs and fixes will be pulled into the docs in a timely manner. But it’s great to see the AWS teams experimenting with ways to improve their documentation at scale.


Software infrastructure configuration options

I ran across this great article when I was reading up on Terraform.

It does a good job of running through the options (puppet, cloudformation, etc) on how to set up your infrastructure via software. Here’s a great quote on why they chose Terraform:

On the other hand, with the kind of declarative approach used in Terraform, the code always represents the latest state of your infrastructure. At a glance, you can tell what’s currently deployed and how it’s configured, without having to worry about history or timing. This also makes it easy to create reusable code, as you don’t have to manually account for the current state of the world. Instead, you just focus on describing your desired state, and Terraform figures out how to get from one state to the other automatically.


Serverless Framework

I had coffee with an acquaintance who is doing a lot of event driven data processing. Whereas ten years ago to tackle this problem you might use an ETL tool like Pentaho or Talend, now his process runs entirely on AWS Lambda functions. He is leveraging the Serverless framework to manage and deploy these applications. As I understand it there is a thin shim layer between the business logic and the lambda event handler, but the business logic is isolated and knows nothing about its environment. That makes the business logic very testable.

His description of the Serverless framework intrigued me. As he described it, the framework is driven by a simple yaml file and takes care of, among other tasks, the complicated infrastructure set up to tie Lambda functions to a variety of AWS events. I haven’t done it myself, but I’ve heard that setting up a lambda to API Gateway link is a real bear. Doing so allows a lambda function respond to a web requests without any AWS authentication, and is a key use case.

You can write and deploy lambda functions in any language that AWS Lambda supports (unfortunately, not java 9 at the moment). Here’s a java/maven/serverless tutorial. It also supports multiple cloud providers, though I haven’t done much beyond note that the documentation exists.

However, using Serverless does require writing code. If evaluating a a complicated ETL process which non developers needed to be able to understand and support, Serverless would not be a good fit. I’m not aware of any abstraction layers on top of it, though I guess you could run, for example, Pentaho Kettle jobs within lambda. There’s also an issue around cold start times–when your code hasn’t been invoked for a while, it can take longer to start up when a request or event occurs. Apparently there are partial solutions, but your lambdas still get cycled every few hours regardless.

I worked through some of the tutorials and was impressed at just how easy it was to get started. If I had a simple API or data processing pipeline to build, Serverless would definitely be on my short list of possible implementation options. It is very inexpensive, scales easily and encourages encapsulation.

Incidentally, my acquaintance’s company is hosting a lunch and learn on this technology at the end of the month. More details here.


Levels of abstraction within AWS ML offerings

I went to an interesting talk last night at the AWS Boulder Denver Meetup where Brett Mitchell, the presenter, had built a sophisticated application using serverless and AWS machine learning technologies to make an animatronic parrot respond to images and speak phrases.  From what he said, the most difficult task was getting the hardware installed into the animatronic parrot–the speaker said the software side of things was prototyped in a night.

But he also said something that I wanted to capture and expand on.  AWS is all about providing building best of breed building blocks that you can pick and choose from to build your application.  Brett mentioned that there are four levels of AWS machine learning abstractions upon which you can build your application.  Here’s my take on what they are and when you should choose them.

  1. Low level compute, like EC2 (maybe even the bare metal offering, if performance is crucial).  At this level, AWS is providing elasticity in terms of compute, but you are entirely in control of the algorithms you implement, the language you use, and how to maintain availability.  With great control comes great responsibility.
  2. Library abstractions like MXNet, Spark ML, and Tensorflow.  You can either use a prepackaged AMI, Sagemaker or use Amazon EMR.  In either case you are leveraging existing libraries and building on top of them.  You still own most of the model management and service availability if you are using the AMI, but the other alternatives remove much of that management hassle.
  3. Managed general purpose supervised learning, which is Amazon Machine Learning (hey, I recorded a course about that!).  In this case, you don’t have access to the model or the algorithms–it’s a black box.  This is aimed at non machine learning folks who have a general problem: “I have structured data I need to make predictions against”.
  4. Focused product offerings like image recognition and speech recognition, among others.  These services may require you to provide some training data.  Here you do nothing except leverage the service.  You have to work within the limits of what the service provides, including performance and tuneability.  The tradeoff is you get scale and high availability for a great price (in both dollars and effort).

These options are all outlined nicely here.


Publish java artifacts to s3 using maven

If you don’t want to run a maven repository server like Nessus, you can use AWS S3 for your maven repository for your java artifacts.  You can make the repository public in the same way as you’d make a website public.  However, it’s more likely that you’ll want to make it private and authenticate using AWS IAM.  Here are some step by step instructions that appear useful.

(Note there’s another S3 maven wagon, but it does not appear to support authentication.)

Why do this?  It’s one more piece of infrastructure that you don’t have to maintain, update, and run.

 


“The future is already here, but it’s only available as a managed AWS service”

This entire post about how Kubernetes could become the distributed operating system of choice is worth reading.  But one statement really struck me:

Well, as they say, the future is already here, but it’s only available as an AWS managed service.

The “they” in this is apparently not William Gibson, as I thought.  More details here.

For the past couple of years the cloud providers have matured and moved from offering infrastructure as a service (disk, compute) to platform as a service offerings (sqs, which is a managed message queue like activemq, or kinesis, a managed data ingestion system like kafka, etc).  Whenever you think about installing a proprietary or open source package, you should include the cloud provider offerings in your evaluation matrix.  Of course, the features you need may not be there, or the cost may be prohibitive, but including them in an evaluation makes sense because of the speed of deployment and the scaling available.

If you think a system architecture can benefit from a message queuing system, do you want to spend time setting up and maintaining such a system, or do you want to spin up an SQS queue in a few minutes?

And the cost may not be prohibitive, depending on the skillset of your internal team and your team’s desire to run such plumbing services.  It can be really hard to estimate running costs of infrastructure services, though you can estimate it by looking at internal teams and seeing similar services they run and how much money it takes.  The nice thing about cloud services is that the costs are very transparent.  The kinesis data streams pricing example walks through a scenario and concludes:

For $1.68 per day, we have a fully-managed streaming data infrastructure that enables us to continuously ingest 4MB of data per second, or 337GB of data per day in a reliable and elastic manner.

Another AWS instructor made the point that AWS and other cloud services invert the running costs of IT infrastructure.  In a typical enterprise, the running costs of your data center and infrastructure are like an iceberg–10% is explicit (server costs, electricity, etc) and 90% is implicit (payroll, time spent upgrading and integrating systems).  In the cloud world those numbers are reversed and far more of your infrastructure cost is explicitly laid out for you and your organization.

Truly the future.


Using a lambda function to make AML real time predictions

When you are making real time AML predictions against an endpoint, you can run the prediction code (sample code) locally.  However, leveraging AWS Lambda can let you build a system that accesses predictions without any servers.  This system will likely be cheaper and scale better than running on your own servers, and you can also trigger predictions on a wide variety of events without writing any polling code.

Here’s a scenario.  You have a model that predicts income level based on user data, which you are going to use for marketing purposes.  The user data is place on S3 by a different process at varying intervals. You want to process each record as it comes in and generate a prediction.  (If you didn’t care about near real time processing, you could run a periodic batch AML job.  This job could also be triggered by lambda.)

So, the plan is to set up a lambda function to monitor the S3 location and whenever a new object is added, run the prediction.

A record will obviously depend on what your model expects.  For one of the the models I built for my AML video course, the record looks like this (data format and details):

22, Local-gov, 108435, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 80, United-States

You will need to enable a real time endpoint for your model.

You also need to create IAM policies which allow access to cloudwatch logs, readonly access to s3, and access to the AML model, and associate all of these with an IAM role which your lambda function can assume.  Here are the policies I have associated with my lambda function (the two describe policies can be found in my github repo):

You then create a lambda function which will be triggered when a new file is added to S3.  Here’s a SAML file which defines the trigger (you’ll have to update the reference to the role you created and your bucket name and path).  More about SAML.

Then, you need to write the lambda function which will pull the file content from S3 and run a prediction. Here’s that code.  It’s similar to prediction code that you might run locally, except how it gets the value string.  We read the values string from the S3 object on line 31.

This code is prototype quality only.  The sample code accesses the prediction and then writes to stdout.  That is fine for sample code, but in a real world scenario you’d obviously want to take further actions.  Perhaps have the lambda function update a database row, add another file to S3 or call an API. You’d also want to have some error handling in case the data in the S3 file wasn’t in the format the model expected.  You’d also want to lock down the S3 access allowed (the policy above allows readonly access to all S3 resources, which is not a good idea for production code).

Finally, S3 is one possible source of input, both others like SNS or Kinesis might be a better fit.  You could also tie into the AWS API Gateway and leverage the features of that service, including authentication, throttling and billing.  In that case, you’d want the lambda function to return the prediction value to the end user.

Using AWS Lambda to access your real time prediction AML endpoint allows you to make predictions against single records in near real time without running any infrastructure of your own.


Online teaching tips for synchronous classrooms

I’ve been teaching AWS courses for the past year or so.  Many have been with an online teaching environment.  This opens up the class to more people–there’s less cost in taking a course from your office or living room, as compared to flying an instructor out for an on-site.  However, this learning environment does have challenges.  Below is my set of best practices, some suggested by other instructors.

Pre class:

  • Set up your physical environment.  This includes making sure you have a fast internet connection, a room with no noise, and that your computer and audio equipment are set up.
  • Set up the virtual room.  Load the materials, set up any links or other notes.  I like to run virtual courses entirely with chat (audio conferences are really hard with 20 people) so I make a note about that.
  • Test your sound.  This includes having a friend login and listen to you beforehand.  This run through can help make sure your voice (which is your primary engagement tool) is accessible to your students.
  • Email a welcome message to all the students, 2-3 days before class starts.  Include when the class is happening, how to get materials, etc.  I’ve definitely had interactions with students from these emails that led to a better outcome for everyone.

During class:

  • Calculate your latency.  Ask an initial question that requires a response as soon as the question is asked.  Something easy like “where are you from?” or “how many years of AWS experience do you have?”  Note the latency and add it into the time you wait before asking for questions.
  • Ask for questions.   How often can vary based on previous AWS experience, but every 5-10 slides is a good place to start.
  • Answer questions honestly.  If you don’t know, say so.  But then say, “I’ll find out for you.”  (And then, of course, find out.)
  • Allow time for students to read the slide.  At least 15 seconds for each slide.
  • You, however, should not read the slide.
  • Draw or use the pointer tool to help engage the students and pull them into the material.
  • Find out what students want out of the class.  Try to angle some of the content toward those desires.  You may be constrained by knowledge or time or presentation material, but you can at least try.
  • Engage your students.  I like to make corny jokes (“have you heard the one about the two hard problems in computers science?“), refer back to technologies they mention having familiarity with, and talk about internet kitten pictures.
  • Remember your voice and energy are the only things keeping these students engaged.

After class:

  • Follow up on any loose ends.  This could be questions you didn’t get answered or more mundane items like how they can get a certificate of completion.  I had one student who couldn’t get access to the materials and it took a few weeks of bugging customer service reps across organizations before he did.  Not a lot of time on my end, but a big deal for him.

Note that I didn’t cover the content or particular technology at all.  They aren’t really relevant.

 



© Moore Consulting, 2003-2017 +