Levels of abstraction within AWS ML offerings

I went to an interesting talk last night at the AWS Boulder Denver Meetup where Brett Mitchell, the presenter, had built a sophisticated application using serverless and AWS machine learning technologies to make an animatronic parrot respond to images and speak phrases.  From what he said, the most difficult task was getting the hardware installed into the animatronic parrot–the speaker said the software side of things was prototyped in a night.

But he also said something that I wanted to capture and expand on.  AWS is all about providing building best of breed building blocks that you can pick and choose from to build your application.  Brett mentioned that there are four levels of AWS machine learning abstractions upon which you can build your application.  Here’s my take on what they are and when you should choose them.

  1. Low level compute, like EC2 (maybe even the bare metal offering, if performance is crucial).  At this level, AWS is providing elasticity in terms of compute, but you are entirely in control of the algorithms you implement, the language you use, and how to maintain availability.  With great control comes great responsibility.
  2. Library abstractions like MXNet, Spark ML, and Tensorflow.  You can either use a prepackaged AMI, Sagemaker or use Amazon EMR.  In either case you are leveraging existing libraries and building on top of them.  You still own most of the model management and service availability if you are using the AMI, but the other alternatives remove much of that management hassle.
  3. Managed general purpose supervised learning, which is Amazon Machine Learning (hey, I recorded a course about that!).  In this case, you don’t have access to the model or the algorithms–it’s a black box.  This is aimed at non machine learning folks who have a general problem: “I have structured data I need to make predictions against”.
  4. Focused product offerings like image recognition and speech recognition, among others.  These services may require you to provide some training data.  Here you do nothing except leverage the service.  You have to work within the limits of what the service provides, including performance and tuneability.  The tradeoff is you get scale and high availability for a great price (in both dollars and effort).

These options are all outlined nicely here.


Using a lambda function to make AML real time predictions

When you are making real time AML predictions against an endpoint, you can run the prediction code (sample code) locally.  However, leveraging AWS Lambda can let you build a system that accesses predictions without any servers.  This system will likely be cheaper and scale better than running on your own servers, and you can also trigger predictions on a wide variety of events without writing any polling code.

Here’s a scenario.  You have a model that predicts income level based on user data, which you are going to use for marketing purposes.  The user data is place on S3 by a different process at varying intervals. You want to process each record as it comes in and generate a prediction.  (If you didn’t care about near real time processing, you could run a periodic batch AML job.  This job could also be triggered by lambda.)

So, the plan is to set up a lambda function to monitor the S3 location and whenever a new object is added, run the prediction.

A record will obviously depend on what your model expects.  For one of the the models I built for my AML video course, the record looks like this (data format and details):

22, Local-gov, 108435, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 80, United-States

You will need to enable a real time endpoint for your model.

You also need to create IAM policies which allow access to cloudwatch logs, readonly access to s3, and access to the AML model, and associate all of these with an IAM role which your lambda function can assume.  Here are the policies I have associated with my lambda function (the two describe policies can be found in my github repo):

You then create a lambda function which will be triggered when a new file is added to S3.  Here’s a SAML file which defines the trigger (you’ll have to update the reference to the role you created and your bucket name and path).  More about SAML.

Then, you need to write the lambda function which will pull the file content from S3 and run a prediction. Here’s that code.  It’s similar to prediction code that you might run locally, except how it gets the value string.  We read the values string from the S3 object on line 31.

This code is prototype quality only.  The sample code accesses the prediction and then writes to stdout.  That is fine for sample code, but in a real world scenario you’d obviously want to take further actions.  Perhaps have the lambda function update a database row, add another file to S3 or call an API. You’d also want to have some error handling in case the data in the S3 file wasn’t in the format the model expected.  You’d also want to lock down the S3 access allowed (the policy above allows readonly access to all S3 resources, which is not a good idea for production code).

Finally, S3 is one possible source of input, both others like SNS or Kinesis might be a better fit.  You could also tie into the AWS API Gateway and leverage the features of that service, including authentication, throttling and billing.  In that case, you’d want the lambda function to return the prediction value to the end user.

Using AWS Lambda to access your real time prediction AML endpoint allows you to make predictions against single records in near real time without running any infrastructure of your own.


Re:Invent Videos

AWS Re:Invent is supposed to be a great conference.  I have thus far been unable to attend, but the videos of the presentations are posted online with about a day’s lag.  So, like most conferences, you really should be networking and meeting people face to face rather than attending the presentations.

Here’s the AWS Youtube channel where you can watch all the videos, or just sample them.

I’ve found the talks to be of varying quality.  Some just rehash the docs, but others, especially the deep dives, discuss interesting aspects of the AWS infrastructure that I haven’t found to be documented anywhere (here’s a great talk about Elastic Block Storage from 2016).  The talks by real customers also give a great viewpoint into how AWS’s offerings are actually implemented to provide business value (here’s a great talk from 2016 about using Amazon Machine Learning to predict real estate transactions).

It’s a sprawling conference, well suited to AWS’s sprawling offering, and I bet no matter what your interest, you will be able to find a video worth watching.


The three stages where you can transform data for Amazon Machine Learning

When creating an AML system, there are three places where you can transform your data. Data transformation and representation are very important for an effective AML system.  I’d suggest watching about five minutes of this re:Invent video (from 29:14 on) to see how they leveraged Redshift to transform purchase data from a format that AML had a hard time “understanding” to one that was “easier” for the system to grok.

The first time to transform your data is before the data ever gets to an AML datasource like s3 or redshift.  You can preprocess the data with whatever technology you want (Redshift/SQL, as above, EMR, bash, python, etc).  Some sample transformations might be:

At this step you have tremendous flexibility, but it requires staging your data.  That may be an issue depending on how much data you have, and may affect which technology you use to do the preprocessing.

The next place you can modify the data is at datasource creation.  You can omit features (but only using the API by providing your own schema with an ‘excludedAttributeNames’ value, not the AWS console), which could speed up processing and lower the total model size.  It could also protect sensitive data.  You do want to provide AML with as much data as you can, however.

As long as a feature is valid in both types, you can create multiple data sources with different data types for a feature.  The only type of feature that I know of that is a valid in multiple AML datatypes is an integer number, which, as long as it only has N values (like human age), could be represented as either a numeric value or a categorical value.

The final place you can modify your data before the model sees it is in the ML recipe. You have about ten or so functions that AML provides that you can apply to your data as it is read from the data source and fed to the model.  You can also create intermediate representations and make the available to your model (lowercase a string feature, for example).

Using a recipe allows you to modify your data before the model sees it, but requires no staging on your part of the source or destination data.  However, the number of transformations is relatively limited.

You can of course combine all three of these methods when building AML models, to give you maximum flexibility.  As always, it’s best to try different options and test the results.


Announcing the Introduction to Amazon Machine Learning Video Course

Would you like an easy introduction to machine learning?  Without downloading any open source software, reading documentation and blog posts, and/or installing and configuring the system?

Amazon Machine Learning is a great way to explore machine learning without having to run any infrastructure.  It lets you build high performance cost effective systems to predict outcomes based on past data.

If you’re interested in learning more about Amazon Machine Learning, you can view my video course on O’Reilly Safari.  Over an hour and a half of video talking about all aspects of Amazon Machine Learning.

This course shows you how to build a model using Amazon Machine Learning (Amazon ML) and use it to make predictions. AWS expert Dan Moore covers the basic types of machine learning, how to prepare your data, and how to make your data available to the Amazon Machine Learning processes. You’ll also learn about evaluating a model for accuracy, using it both for batch and real-time predictions, and using tags to manage environments. Designed for developers and technical marketers new to machine learning and for data scientists interested in using the AWS Amazon ML platform, the course provides hands-on experience building a working predictive model using real data. Learners should obtain an AWS account (free from Amazon) and a basic understanding of AWS concepts before beginning the course.


AWS machine learning talk

I enjoyed giving my “Intro to Amazon Machine Learning” talk at the AWS Denver Boulder meetup.   (Shout out to an old friend and colleague who came out to see it.) I didn’t get through the whole pipeline demonstration (I didn’t get a chance to do the batch prediction), but the demo gods were kind and the demo went well.

We also had a good discussion.  A few folks present had used machine learning before, so we talked about where AML made sense (hint, it’s not a fit for every problem).  Also had some good questions about AML, about performance and pricing.  One of the members shared a reinvent anecdote: the AML team looked at all the machine learning used in Amazon and graphed the use cases and solved for the most common ones.

As, usual, I also learned something. OpenRefine is a tool to help you prepare data for machine learning.  And when you change the score cut-off, you need to restart your real-time end point.

The “Intro to Amazon Machine Learning” slides are up on SlideShare, and big thanks to the Meetup organizers.



Amazon Machine Learning: An Introduction

From my book, Amazon Machine Learning: An Introduction:

Amazon Machine Learning, or AML, provides you access to widely applicable machine learning algorithms without having to run any servers.  This type of learning is useful for making predictions based on a set of data for which answers are known.  AML supports supervised learning with the stochastic gradient descent algorithm.  The end goal of AML is to create a model, which is what will allow you to make further predictions based on past data.

AML supports three different kinds of predictions.  For binary outcomes, where observations lead to a yes/no result, AML supports binary classification.  An example would be whether or not a prospect is likely to sign up for a new account, given their past interactions with your company.  For multi valued results, where observations lead to one of N results, AML supports multi class classification.  A good example of this would be which product to show a customer, given what they’ve looked at and bought in the past.  And, for numeric values, AML supports regression.  An example of that would be predicting house prices based on sales data and house attributes.

If you are not trying to use existing data and create predictions out of it using supervised learning, but are trying to instead recognize images or tease out patterns in text, you may want to consider alternatives to AML.


Amazon Machine Learning Video and Book

I’m working on a video series and an ebook about Amazon Machine Learning, or AML.

AML  is a great way to get started with machine learning, since you can focus on the key concepts of building and using a model and not worry about any infrastructure.  AWS takes care of provisioning all the underlying IT infrastructure–you just worry about getting your data to S3, choosing how to build the model, and then using the model.  Which, trust me, is quite enough to tackle if you are a machine learning newbie.

You can use the model to get predictions either in real time (with a default soft limit of 200 requests per second) or via batch processing, where you can upload up to 1TB of predictions to S3.  Like everything in AWS, you can control the entire process via a well documented API or from various SDKs.

AML isn’t a fit for all machine learning needs–it processes text that is in CSV format and supports only supervised learning.  There are other options on AWS (and other places as well).

The book is currently in progress, and I’ll be starting on the video soon.If you’d like to follow along as the book gets written, you can at leanpub: Amazon Machine Learning: An Introduction.



© Moore Consulting, 2003-2017 +