Amazon Machine Learning

Machine Learning, Software Engineering 1904 views

Update: Amazon has rolled this functionality into their SageMaker product.

AWS ML alert

I was able to test out Amazon's new Machine Learning capability. I had already set up an AWS account previously when researching options for a previous project.

I walked through their banking example which does a binary classification. The example has you uploading a dataset to S3, creating a datasource, creating a model and running predications. All of this was done via their web application interface. Though, more finer-grained control can be achieve by programmatically interacting with their API.

AWS ML Jobs

Amazon can automatically split your datasource into test cases (70/30). After the model is created, you can adjust your score threshold.

AWS ML threshold

This particular model (using the provided data) had a quality score (aka Area Under Curve) of 0.94 which was deemed "extremely good". However, I could not find anywhere which algorithm it used.

Looking at one of the log files, I can see the model processed my 4,119-row batch data file successfully:

15/04/13 17:06:28 INFO: BATCH_PREDICTION bp-bp-fGzXCjPIo83 started
15/04/13 17:06:28 INFO: Processing file s3://xxx/banking-batch.csv ...
15/04/13 17:06:32 INFO: Number of records processed without error : 4119
15/04/13 17:06:32 INFO: Number of bad records : 0
15/04/13 17:06:32 INFO: BATCH_PREDICTION bp-bp-fGzXCjPIo83 completed.

Here's a snippet of the prediction results, which were generated on S3:

AWS ML prediction results

The rows correspond to the rows in the submitted data. Though, you can also identify an ID column which Amazon can include in the results. It's up to the Analyst to determine what to do with the results.

Amazon ML is an interesting tool in the arsenal of tools a Data Scientist has access too. However, I believe it only becomes truly useful when big data is an issue. It's better to rely upon Amazon's elastic infrastructure and computing power. The lack of options and black-box nature of the tool is a drawback. Most Data Scientists will want to know the details, and control various aspects of the algorithmic pipeline.

Microsoft and Google also offer cloud-based, Predictive Analytics products as well. I would recommend evaluating each individually, on an as-needed basis. Overall though, I am more likely to lean towards Amazon's cloud services, since they have the lionshare of cloud market and have been doing it for the longest.