Skip to main content

AWS SageMaker Tutorial: Part 6


In this tutorial we will forecast sales based on historical data.

The dataset contains weekly sales from 99 departments belonging to 45 different stores.

The data contains holidays and promotional markdowns offered by various stores and several departments throughout the year.

The data consists of three sheets:

  1. Stores
  2. Features
  3. Sales

This model will use a high order polynomial regression (complex)

Import Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import zipfile

Import Datasets

# import the csv files using pandas
features = pd.read_csv('Features_data_set.csv')
sales = pd.read_csv('sales_data_set.csv')
stores = pd.read_csv('stores_data_set.csv')

Shape Data

Using the methods we have noted in Ncoughlin: Python Data Manipulation

We have merged our three csv files into one datafield assigned the value df. And merged a new column onto the end of the data called Month which we extracted from the date column with a get_month function.

# defining a function get_month which takes a datetime as an argument
def get_month(datetime):
# and returns the second string split as an integer
return int(str(datetime).split('-')[1])

and we can see an example of that working here


And we can then apply that function to our datafield df targeting the ['Date'] column

df['Month'] = df['Date'].apply(get_month)

We have also converted the IsHoliday boolean to float32 numeric values (1/0), and also extracted the month from the date and put it in it's own column at the end. That gives us the current shape (random sample of ten rows).

26380527582011-12-23 00:00:00230041.593.5872415.4239.891438.33209.342976.85140.5297.906A20418412
12505013362012-01-13 00:00:00580.73025.613.0563627.0714887.9119.93798.484668.82130.2446.104A2196221
23457024592012-02-24 00:00:00366.67037.353.91710065.810556.72.26791.19440.12137.3418.659A2038192
23498624372012-06-04 00:00:003240.93045.254.14324195059.683082.345170.39137.7978.983A2038196
34102936822010-11-06 00:00:003847.78082.32.61500000210.2148.464A3991011
10183811322011-11-03 00:00:007377.45063.293.45900000217.4657.551A20749911
34765237462010-07-16 00:00:008194.4084.432.62300000209.8638.36C399107
38401841712010-03-12 00:00:003411.520332.71200000190.9937.508A1963213
1854301982012-04-27 00:00:0042555.6042.454.1633735.3-265.7652.8267.722959.27137.9788.15A2038194
8900610112010-06-25 00:00:0038635.2090.323.08400000126.1279.524B1265126

Next we can drop the date column because we only really want to predict sales by month and we have extracted the month already

# Drop the date
df_target = df['Weekly_Sales']
df_final = df.drop(columns = ['Weekly_Sales', 'Date'])

Lastly we still have a couple of columns that we need to break out into unique values with dummies so that we can group them later in the analysis. Namely Type, Store and Dept. Therefore we use the get_dummies() method.

df_final = pd.get_dummies(df_final, columns = ['Type', 'Store', 'Dept'], drop_first = True)

Which now gives us a very large number of columns

(421570, 137)

And we can see that every Type, Store and Dept has it's own column now.


Define X and y

We can now define our X and y and convert each respective datafield into float32 (which we can do as all strings and booleans have been removed)

X = np.array(df_final).astype('float32')
y = np.array(df_target).astype('float32')

Ensure y has a column

# reshaping the array from (421570,) to (421570, 1)
y = y.reshape(-1,1)
(421570, 1)

Scale the Data

Typically we would want to scale our data, however with the XGBoost Algorithm this does not make much of a difference, so we will leave this out.

# scaling the data before feeding the model
# from sklearn.preprocessing import StandardScaler, MinMaxScaler

# scaler_x = StandardScaler()
# X = scaler_x.fit_transform(X)

# scaler_y = StandardScaler()
# y = scaler_y.fit_transform(y)

Split Data Into Training and Testing Data

We of course need to split our data into training and testing data, which we do here with an 85/15 split, and then immediately after split the testing data into testing/validation data with a 50/50 split.

# spliting the data in to test and train sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size = 0.5)

Install XGBoost

!pip install xgboost

install xgboost

Train XGBoost Regressor Model in Local Mode

Now we can finally train our regressor model

# Train an XGBoost regressor model

import xgboost as xgb

model = xgb.XGBRegressor(objective ='reg:squarederror', learning_rate = 0.1, max_depth = 5, n_estimators = 100), y_train)

train model

Predict Score with Testing Dataset

# predict the score of the trained model using the testing dataset

result = model.score(X_test, y_test)

print("Accuracy : {}".format(result))
Accuracy : 0.8240983380813032

82% accurracy, great!

Make Predictions on Test Data

# make predictions on the test data

y_predict = model.predict(X_test)

Which gives us an array of 31,618 values in a float32 format array. These are the predicted sales figures output from the X_test data that we fed to our model.

predicted data

Calculate Metrics

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt
k = X_test.shape[1]
n = len(X_test)
RMSE = float(format(np.sqrt(mean_squared_error(y_test, y_predict)),'.3f'))
MSE = mean_squared_error(y_test, y_predict)
MAE = mean_absolute_error(y_test, y_predict)
r2 = r2_score(y_test, y_predict)
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)

print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2)
RMSE = 9779.869
MSE = 95645850.0
MAE = 6435.3916
R2 = 0.8192406043997631
Adjusted R2 = 0.8184539450224686

Automated Amazon Reports

Automatically download Amazon Seller and Advertising reports to a private database. View beautiful, on demand, exportable performance reports.