by Prof Tim Dodwell
General Linear Models
5. Controlling ML Models - Regularisation in Practise
Controlling ML Models - Regularisation in Practise
In this walkthrough we are going to look at regularisation, a really central approach to prevent overfitting in machine learning models. Here you learn how to build a linear model, and apply regularisation methods (e.g. LASSO). We will clearly see the benefits of this approach.
Adding Libaries
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
import matplotlib.pyplot as plt
Loading the dataset, scaling and transformation into features.
After loading various libraries, we now collecting the Boston housing data set.
In this example we will consider a single input and a single output for our regression challenge.
- the input is LSTAT which is a measure of the number of low income families
- the ouput is the median (local) house price over the data set
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv"
df = pd.read_csv(url, header = None)
data = df.values
X = data[0:50, 12].reshape(-1, 1) # Single input data - LSTAT
y = data[0:50, 13].reshape(-1, 1) # Target Variable - Median House Price
Nice and easy here. Read data in upload it into a dataframe.
We now can get input and output data in good shape. The call reshape(-1,1)
makes sure that the array vector and so the shape is of the array is (n,1)
rather just (n,)
.
We then move on to a sklearn pipeline. Here we first apply a scaling MinMaxScaler()
to scale this input between and . Instead of looking at a single feature, we then expand the input representation to a polynomial feature space of order 10, i.e.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
pipe = Pipeline(
[
("minmax", MinMaxScaler()),
("feature", PolynomialFeatures(degree=10)),
]
)
X_poly = pipe.fit_transform(X)
Test / train split of the data
We then do the normal test train / split of the data set.
X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.3, random_state=123)
Looking at the data
fig, ax = plt.subplots(1, 1)
ax.plot(X_train[:,1], y_train, 'ob', alpha=0.4)
ax.plot(X_test[:,1], y_test, 'og', alpha=0.4)
plt.ylabel('Output - Media House Value')
plt.xlabel('Input - LSTAT (scaled)')
plt.show()
Fitting a linear model
Ok so now it is time to fit the linear model.
lr = LinearRegression().fit(X_train, y_train)
Then let us plot out how to see how we did over the range of the data set.
X_plot = np.linspace(0.0, 1.0, 100).reshape(-1,1)
X_plot_poly = PolynomialFeatures(degree=10).fit_transform(X_plot)
y_plot = lr.predict(X_plot_poly)
fig, ax = plt.subplots(1, 1)
ax.plot(X_train[:,1], y_train, 'ob', alpha=0.4, label='Training Data')
ax.plot(X_test[:,1], y_test, 'og', alpha=0.4, label='Testing Data')
ax.plot(X_plot, y_plot, '-',color='lightcoral', label='Polynomial Fit')
plt.ylabel('Output - Media House Value - Thousand Dollars')
plt.xlabel('Input - LSTAT (scaled) ')
plt.ylim([0.0, 50.])
plt.xlim([0.0, 1.])
plt.show()
We see this doesn't look good, particularly to the righthand side between the values of 0.7 of 1.0, the predicted model does not generalise well.
The model is too expressive, the model is clearly overfitting the data.
Applying Regularisation
Let us follow what we discussed in the previous explainer on regularsiation, and use a LASSO model to fit to the data.
We aren't sure what the best regularisation strength (which is from the notes) should be. We therefore fit lots of models over a range of values between and . Here we use a simple loop but we could of used sklearn
's grid search functions
from sklearn.linear_model import Lasso
alpha = np.linspace(0.001, 0.2, 1000)
training = []
testing = []
for a in alpha:
lass = Lasso(alpha=a).fit(X_train, y_train)
training.append(lass.score(X_train, y_train))
testing.append(lass.score(X_test, y_test))
We can now plot out the training and testing scores
plt.plot(alpha, training, '-b', label='Training')
plt.plot(alpha, testing, '-g', label='Testing')
plt.ylabel('Testing and Training')
plt.xlabel('Regularisation Strength')
plt.legend()
plt.show()
The function score()
returns the coefficient of determination of the prediction.
The coefficient of determination is given by
where is the residual sum of squares ((y_true - y_pred)** 2).sum() and is the total sum of squares ((y_true - y_true.mean()) ** 2).sum()
.
The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a score of .
Ok let us pick the regularisation strength where we do best over our testing data. So about a value of
lass = Lasso(alpha=0.025).fit(X_train, y_train)
y_plot = lass.predict(X_plot_poly)
fig, ax = plt.subplots(1, 1)
ax.plot(X_train[:,1], y_train, 'ob', alpha=0.4, label='Training Data')
ax.plot(X_test[:,1], y_test, 'og', alpha=0.4, label='Testing Data')
ax.plot(X_plot, y_plot, '-',color='lightcoral', label='Polynomial Fit')
plt.ylabel('Output - Media House Value - Thousand Dollars')
plt.xlabel('Input - LSTAT (scaled)')
plt.ylim([0.0, 50])
plt.show()
This looks alot better. No oscillations that try to model the noise in the data, generalises well particularly between input values of and where the other model performed very badly.
We can now look to see what the final coefficients of this model were
print(lass.intercept)
print(lass.coef_)
array([ 31.27 ])
array([ 0. , -33.77058217, 0. , 17.24305027,
0. , 0. , -0. , -0. ,
-0. , -0. , -0. ])
So this gives us a nice simple model as the prediction
We go through this result in more detail in the explainer.
Now you fitted a LASSO model why not try to fit a Ridge Regression model and compare the outputs of the functions you fit.
Hint you need to import the following
from sklearn.linear_model import Ridge
which can be used in the same way as Lasso