Skip to content

The three stages where you can transform data for Amazon Machine Learning

When creating an AML system, there are three places where you can transform your data. Data transformation and representation are very important for an effective AML system.  I’d suggest watching about five minutes of this re:Invent video (from 29:14 on) to see how they leveraged Redshift to transform purchase data from a format that AML had a hard time “understanding” to one that was “easier” for the system to grok.

The first time to transform your data is before the data ever gets to an AML datasource like s3 or redshift.  You can preprocess the data with whatever technology you want (Redshift/SQL, as above, EMR, bash, python, etc).  Some sample transformations might be:

At this step you have tremendous flexibility, but it requires staging your data.  That may be an issue depending on how much data you have, and may affect which technology you use to do the preprocessing.

The next place you can modify the data is at datasource creation.  You can omit features (but only using the API by providing your own schema with an ‘excludedAttributeNames’ value, not the AWS console), which could speed up processing and lower the total model size.  It could also protect sensitive data.  You do want to provide AML with as much data as you can, however.

As long as a feature is valid in both types, you can create multiple data sources with different data types for a feature.  The only type of feature that I know of that is a valid in multiple AML datatypes is an integer number, which, as long as it only has N values (like human age), could be represented as either a numeric value or a categorical value.

The final place you can modify your data before the model sees it is in the ML recipe. You have about ten or so functions that AML provides that you can apply to your data as it is read from the data source and fed to the model.  You can also create intermediate representations and make the available to your model (lowercase a string feature, for example).

Using a recipe allows you to modify your data before the model sees it, but requires no staging on your part of the source or destination data.  However, the number of transformations is relatively limited.

You can of course combine all three of these methods when building AML models, to give you maximum flexibility.  As always, it’s best to try different options and test the results.

Leave a Reply

Your email address will not be published. Required fields are marked *