AWS SageMaker Tutorial: Part 4

Intro

Now we are ready to create a multiple linear regression model using the built in SageMaker algorithms.

Setup

In the setup we need to import our libraries and define the name of our bucket and subfolder for our model

# Boto3 is the Amazon Web Services (AWS) Software Development Kit (SDK) for Python
# Boto3 allows Python developer to write software that makes use of services like Amazon S3 and Amazon EC2

import sagemaker
import boto3
from sagemaker import Session

# Let's create a Sagemaker session
sagemaker_session = sagemaker.Session()
bucket = 'tutorial-sagemaker-insurance-multiple-linear-learner'
prefix = 'multiple_linear_learner' # prefix is the subfolder within the bucket.

# Let's get the execution role for the notebook instance.
# This is the IAM role that you created when you created your notebook instance. You pass the role to the training job.
# Note that AWS Identity and Access Management (IAM) role that Amazon SageMaker can assume to perform tasks on your behalf (for example, reading training results, called model artifacts, from the S3 bucket and writing training results to Amazon S3).
role = sagemaker.get_execution_role()
print(role)

⚠️ The Bucket and Folder that you reference here must already exist, they will not be created for you. If they don’t already exist you will get an error later

Convert Data to RecordIO Format

current data shape

This is the format that is specified in the documentation

import io # The io module allows for dealing with various types of I/O (text I/O, binary I/O and raw I/O).
import numpy as np
import sagemaker.amazon.common as smac # sagemaker common libary

# Code below converts the data in numpy array format to RecordIO format
# This is the format required by Sagemaker Linear Learner

buf = io.BytesIO() # create an in-memory byte array (buf is a buffer I will be writing to)
smac.write_numpy_to_dense_tensor(buf, X_train, y_train.reshape(-1))
buf.seek(0)
# When you write to in-memory byte arrays, it increments 1 every time you write to it
# Let's reset that back to zero

S3 Bucket

Upload Training Data

import os

# Code to upload RecordIO data to S3

# Key refers to the name of the file
key = 'linear-train-data'

# The following code uploads the data in record-io format to S3 bucket to be accessed later for training
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', key)).upload_fileobj(buf)

# Let's print out the training data location in s3
s3_train_data = 's3://{}/{}/train/{}'.format(bucket, prefix, key)
print('uploaded training data location: {}'.format(s3_train_data))
uploaded training data location: s3://tutorial-sagemaker-insurance-multiple-linear-learner/multiple_linear_learner/train/linear-train-data

Output Artifact Location

# create an output placeholder in S3 bucket to store the linear learner output

output_location = 's3://{}/{}/output'.format(bucket, prefix)
print('Training artifacts will be uploaded to: {}'.format(output_location))

Get Algorithm Container

# This code is used to get the training container of sagemaker built-in algorithms
# all we have to do is to specify the name of the algorithm, that we want to use

# Let's obtain a reference to the linearLearner container image
# Note that all regression models are named estimators
# You don't have to specify (hardcode) the region, get_image_uri will get the current region name using boto3.Session

region = boto3.Session().region_name
container = sagemaker.image_uris.retrieve('linear-learner',region)
container
'382416733822.dkr.ecr.us-east-1.amazonaws.com/linear-learner:1'

Train the Model

# We have pass in the container, the type of instance that we would like to use for training
# output path and sagemaker session into the Estimator.
# We can also specify how many instances we would like to use for training

linear = sagemaker.estimator.Estimator(container,
                                       role,
                                       instance_count = 1,
                                       instance_type = 'ml.c4.xlarge',
                                       output_path = output_location,
                                       sagemaker_session = sagemaker_session,
                                       # reduce cost with spot instances
                                       use_spot_instances = True,
                                       max_run = 300,
                                       max_wait = 600
                                      )


# We can tune parameters like the number of features that we are passing in, type of predictor like 'regressor' or 'classifier', mini batch size, epochs
# Train 32 different versions of the model and will get the best out of them (built-in parameters optimization!)

linear.set_hyperparameters(feature_dim = 8,
                           predictor_type = 'regressor',
                           mini_batch_size = 100,
                           epochs = 100,
                           num_models = 32,
                           loss = 'absolute_loss')

# Now we are ready to pass in the training data from S3 to train the linear learner model

linear.fit({'train': s3_train_data})

# Let's see the progress using cloudwatch logs

This took significantly longer than the previous linear learner model to process

training log 1

[10/26/2021 18:06:50 INFO 139621688817472] #throughput_metric: host=algo-1, train throughput=4583.929433778798 records/second
[10/26/2021 18:06:50 WARNING 139621688817472] wait_for_all_workers will not sync workers since the kv store is not running distributed
[10/26/2021 18:06:50 WARNING 139621688817472] wait_for_all_workers will not sync workers since the kv store is not running distributed
[2021-10-26 18:06:50.454] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 52, "duration": 0, "num_examples": 1, "num_bytes": 7600}
[2021-10-26 18:06:50.470] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 54, "duration": 14, "num_examples": 11, "num_bytes": 81320}
[10/26/2021 18:06:50 INFO 139621688817472] #train_score (algo-1) : ('absolute_loss_objective', 0.30221866892877025)
[10/26/2021 18:06:50 INFO 139621688817472] #train_score (algo-1) : ('mse', 0.2992234559816735)
[10/26/2021 18:06:50 INFO 139621688817472] #train_score (algo-1) : ('absolute_loss', 0.30221866892877025)
[10/26/2021 18:06:50 INFO 139621688817472] #quality_metric: host=algo-1, train absolute_loss_objective <loss>=0.30221866892877025
[10/26/2021 18:06:50 INFO 139621688817472] #quality_metric: host=algo-1, train mse <loss>=0.2992234559816735
[10/26/2021 18:06:50 INFO 139621688817472] #quality_metric: host=algo-1, train absolute_loss <loss>=0.30221866892877025
[10/26/2021 18:06:50 INFO 139621688817472] Best model found for hyperparameters: {"optimizer": "adam", "learning_rate": 0.1, "wd": 0.01, "l1": 0.0, "lr_scheduler_step": 100, "lr_scheduler_factor": 0.99, "lr_scheduler_minimum_lr": 1e-05}
2021-10-26 18:30:14 Uploading - Uploading generated training model
2021-10-26 18:30:14 Completed - Training job completed
Training seconds: 72
Billable seconds: 30
Managed Spot Training savings: 58.3%

Deploy The Model

# Deploying the model to perform inference

linear_regressor = linear.deploy(initial_instance_count = 1,
                                          instance_type = 'ml.m4.xlarge')

Test The Model

Select Serializers

Now we can pass in our X_test data, but first we must choose which serializer and deserializer we would like to use to transform our data. The documentation states that the incoming data must be in CSV format, so will use the CSV serializer, and then we can pick how we would like to receive the data back. Here we would like it back in JSON format.

from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer

# Content type overrides the data that will be passed to the deployed model, since the deployed model expects data in text/csv format.

# Serializer accepts a single argument, the input data, and returns a sequence of bytes in the specified content type

# Deserializer accepts two arguments, the result data and the response content type, and return a sequence of bytes in the specified content type.

linear_regressor.serializer = CSVSerializer()
linear_regressor.deserializer = JSONDeserializer()

Pass Test Data into Model

and now we can pass our test data into our deployed model

# making prediction on the test data

result = linear_regressor.predict(X_test)

and we can see that we have results, but they are scaled and in a JSON object

scaled results array

Shape Results

Let’s first convert the data in the object into an array, and then we can scale it back.

# Since the result is in json format, we access the scores by iterating through the scores in the predictions

predictions = np.array([r['score'] for r in result['predictions']])

predictions in array

and now we can return the predictions to their original scale

y_predict_orig = scaler_y.inverse_transform(predictions)
y_test_orig = scaler_y.inverse_transform(y_test)

y_test_orig

original scale y predictions

Review Metrics

Let’s review our metrics to see how our model did!

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt

RMSE = float(format(np.sqrt(mean_squared_error(y_test_orig, y_predict_orig)),'.3f'))
MSE = mean_squared_error(y_test_orig, y_predict_orig)
MAE = mean_absolute_error(y_test_orig, y_predict_orig)
r2 = r2_score(y_test_orig, y_predict_orig)
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)

print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2) 
RMSE = 6166.98 
MSE = 38031640.86666906 
MAE = 3418.82080646875 
R2 = 0.7550276878607206 
Adjusted R2 = 0.7474609755166501

Delete endpoint

And now let us shut down the endpoint so that we aren’t running an EC2 instance for nothing.

# Delete the end-point

linear_regressor.delete_endpoint()