Skip to content

A video of My Amazon Machine Learning Talk

I gave a talk at Develop Denver last year about Amazon Machine Learning. They recorded it and you can now view the video. I feel a bit like a superhero in the shadows, because the lighting situation was such that you couldn’t see both my face and my slides at the same time, but if you want to see what AML is all about and how it can help you experiment with supervised machine learning in a lightweight, cheap, fast manner, please check it out.

The full video is about 35 minutes long.

AWS Advent Amazon Machine Learning Post

Last year I wrote about AWS Advent, which is an exploration of the vast reaches of AWS in the first 24 days of December. This year, I submitted a post for it. And that post is now up and available. I wrote about Machine Learning (ML) generally and Amazon Machine Learning (AML) specifically. From the post:

AML Models are immutable, as are data sources. If you need to incorporate ongoing data into your model, which is generally a good idea, you need to automate your datasource and model building process so they are repeatable. Then, when you have new data, you can rebuild your model, test it out, and then change which model is “in production” and making predictions.

Plenty more stuff in there about the ethics of ML and the various pieces of the AML pipeline. Check it out.

 

Excited to be speaking at Develop Denver

Coffee break at a conferenceI’m excited to be speaking at Develop Denver. This is a local conference with a wide variety of topics of interest to developers, designers and in general anyone who works in the interactive industry. From their website, they want to:

[bring] together developers, designers, strategists, and those looking to dive deeper into the interactive world for two days of hands on code & design talks.

I’ll be doing two presentations. The first is my talk on Amazon Machine Learning, which I’ve presented previously. The second is a lightning talk on the awk programming language. I’m excited to be presenting, but I’m also looking forward to interesting talks from other speakers, covering topics such as IoT, software development for the developing world, web scraping, APIs, oauth, software development, and hiring practices. (That list is tilted toward my interest in development–there’s plenty for everyone.)

If you’re able to join, it’s happening in about two weeks in downtown Denver (Oct 18-19 in the RiNo district). Here’s the link for tickets, and here’s the agenda.

Levels of abstraction within AWS ML offerings

I went to an interesting talk last night at the AWS Boulder Denver Meetup where Brett Mitchell, the presenter, had built a sophisticated application using serverless and AWS machine learning technologies to make an animatronic parrot respond to images and speak phrases.  From what he said, the most difficult task was getting the hardware installed into the animatronic parrot–the speaker said the software side of things was prototyped in a night.

But he also said something that I wanted to capture and expand on.  AWS is all about providing building best of breed building blocks that you can pick and choose from to build your application.  Brett mentioned that there are four levels of AWS machine learning abstractions upon which you can build your application.  Here’s my take on what they are and when you should choose them.

  1. Low level compute, like EC2 (maybe even the bare metal offering, if performance is crucial).  At this level, AWS is providing elasticity in terms of compute, but you are entirely in control of the algorithms you implement, the language you use, and how to maintain availability.  With great control comes great responsibility.
  2. Library abstractions like MXNet, Spark ML, and Tensorflow.  You can either use a prepackaged AMI, Sagemaker or use Amazon EMR.  In either case you are leveraging existing libraries and building on top of them.  You still own most of the model management and service availability if you are using the AMI, but the other alternatives remove much of that management hassle.
  3. Managed general purpose supervised learning, which is Amazon Machine Learning (hey, I recorded a course about that!).  In this case, you don’t have access to the model or the algorithms–it’s a black box.  This is aimed at non machine learning folks who have a general problem: “I have structured data I need to make predictions against”.
  4. Focused product offerings like image recognition and speech recognition, among others.  These services may require you to provide some training data.  Here you do nothing except leverage the service.  You have to work within the limits of what the service provides, including performance and tuneability.  The tradeoff is you get scale and high availability for a great price (in both dollars and effort).

These options are all outlined nicely here.

Using a lambda function to make AML real time predictions

When you are making real time AML predictions against an endpoint, you can run the prediction code (sample code) locally.  However, leveraging AWS Lambda can let you build a system that accesses predictions without any servers.  This system will likely be cheaper and scale better than running on your own servers, and you can also trigger predictions on a wide variety of events without writing any polling code.

Here’s a scenario.  You have a model that predicts income level based on user data, which you are going to use for marketing purposes.  The user data is place on S3 by a different process at varying intervals. You want to process each record as it comes in and generate a prediction.  (If you didn’t care about near real time processing, you could run a periodic batch AML job.  This job could also be triggered by lambda.)

So, the plan is to set up a lambda function to monitor the S3 location and whenever a new object is added, run the prediction.

A record will obviously depend on what your model expects.  For one of the the models I built for my AML video course, the record looks like this (data format and details):

22, Local-gov, 108435, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 80, United-States

You will need to enable a real time endpoint for your model.

You also need to create IAM policies which allow access to cloudwatch logs, readonly access to s3, and access to the AML model, and associate all of these with an IAM role which your lambda function can assume.  Here are the policies I have associated with my lambda function (the two describe policies can be found in my github repo):

You then create a lambda function which will be triggered when a new file is added to S3.  Here’s a SAML file which defines the trigger (you’ll have to update the reference to the role you created and your bucket name and path).  More about SAML.

Then, you need to write the lambda function which will pull the file content from S3 and run a prediction. Here’s that code.  It’s similar to prediction code that you might run locally, except how it gets the value string.  We read the values string from the S3 object on line 31.

This code is prototype quality only.  The sample code accesses the prediction and then writes to stdout.  That is fine for sample code, but in a real world scenario you’d obviously want to take further actions.  Perhaps have the lambda function update a database row, add another file to S3 or call an API. You’d also want to have some error handling in case the data in the S3 file wasn’t in the format the model expected.  You’d also want to lock down the S3 access allowed (the policy above allows readonly access to all S3 resources, which is not a good idea for production code).

Finally, S3 is one possible source of input, both others like SNS or Kinesis might be a better fit.  You could also tie into the AWS API Gateway and leverage the features of that service, including authentication, throttling and billing.  In that case, you’d want the lambda function to return the prediction value to the end user.

Using AWS Lambda to access your real time prediction AML endpoint allows you to make predictions against single records in near real time without running any infrastructure of your own.

Re:Invent Videos

AWS Re:Invent is supposed to be a great conference.  I have thus far been unable to attend, but the videos of the presentations are posted online with about a day’s lag.  So, like most conferences, you really should be networking and meeting people face to face rather than attending the presentations.

Here’s the AWS Youtube channel where you can watch all the videos, or just sample them.

I’ve found the talks to be of varying quality.  Some just rehash the docs, but others, especially the deep dives, discuss interesting aspects of the AWS infrastructure that I haven’t found to be documented anywhere (here’s a great talk about Elastic Block Storage from 2016).  The talks by real customers also give a great viewpoint into how AWS’s offerings are actually implemented to provide business value (here’s a great talk from 2016 about using Amazon Machine Learning to predict real estate transactions).

It’s a sprawling conference, well suited to AWS’s sprawling offering, and I bet no matter what your interest, you will be able to find a video worth watching.

The three stages where you can transform data for Amazon Machine Learning

When creating an AML system, there are three places where you can transform your data. Data transformation and representation are very important for an effective AML system.  I’d suggest watching about five minutes of this re:Invent video (from 29:14 on) to see how they leveraged Redshift to transform purchase data from a format that AML had a hard time “understanding” to one that was “easier” for the system to grok.

The first time to transform your data is before the data ever gets to an AML datasource like s3 or redshift.  You can preprocess the data with whatever technology you want (Redshift/SQL, as above, EMR, bash, python, etc).  Some sample transformations might be:

At this step you have tremendous flexibility, but it requires staging your data.  That may be an issue depending on how much data you have, and may affect which technology you use to do the preprocessing.

The next place you can modify the data is at datasource creation.  You can omit features (but only using the API by providing your own schema with an ‘excludedAttributeNames’ value, not the AWS console), which could speed up processing and lower the total model size.  It could also protect sensitive data.  You do want to provide AML with as much data as you can, however.

As long as a feature is valid in both types, you can create multiple data sources with different data types for a feature.  The only type of feature that I know of that is a valid in multiple AML datatypes is an integer number, which, as long as it only has N values (like human age), could be represented as either a numeric value or a categorical value.

The final place you can modify your data before the model sees it is in the ML recipe. You have about ten or so functions that AML provides that you can apply to your data as it is read from the data source and fed to the model.  You can also create intermediate representations and make the available to your model (lowercase a string feature, for example).

Using a recipe allows you to modify your data before the model sees it, but requires no staging on your part of the source or destination data.  However, the number of transformations is relatively limited.

You can of course combine all three of these methods when building AML models, to give you maximum flexibility.  As always, it’s best to try different options and test the results.

Announcing the Introduction to Amazon Machine Learning Video Course

Would you like an easy introduction to machine learning?  Without downloading any open source software, reading documentation and blog posts, and/or installing and configuring the system?

Amazon Machine Learning is a great way to explore machine learning without having to run any infrastructure.  It lets you build high performance cost effective systems to predict outcomes based on past data.

If you’re interested in learning more about Amazon Machine Learning, you can view my video course on O’Reilly Safari.  Over an hour and a half of video talking about all aspects of Amazon Machine Learning.

This course shows you how to build a model using Amazon Machine Learning (Amazon ML) and use it to make predictions. AWS expert Dan Moore covers the basic types of machine learning, how to prepare your data, and how to make your data available to the Amazon Machine Learning processes. You’ll also learn about evaluating a model for accuracy, using it both for batch and real-time predictions, and using tags to manage environments. Designed for developers and technical marketers new to machine learning and for data scientists interested in using the AWS Amazon ML platform, the course provides hands-on experience building a working predictive model using real data. Learners should obtain an AWS account (free from Amazon) and a basic understanding of AWS concepts before beginning the course.

AWS machine learning talk

I enjoyed giving my “Intro to Amazon Machine Learning” talk at the AWS Denver Boulder meetup.   (Shout out to an old friend and colleague who came out to see it.) I didn’t get through the whole pipeline demonstration (I didn’t get a chance to do the batch prediction), but the demo gods were kind and the demo went well.

We also had a good discussion.  A few folks present had used machine learning before, so we talked about where AML made sense (hint, it’s not a fit for every problem).  Also had some good questions about AML, about performance and pricing.  One of the members shared a reinvent anecdote: the AML team looked at all the machine learning used in Amazon and graphed the use cases and solved for the most common ones.

As, usual, I also learned something. OpenRefine is a tool to help you prepare data for machine learning.  And when you change the score cut-off, you need to restart your real-time end point.

The “Intro to Amazon Machine Learning” slides are up on SlideShare, and big thanks to the Meetup organizers.