by Dr Andy Corbett
Support Vector Machines
15. Comparing Least-Squares with SVM Regression
Download the resources for this lesson here.
Our understanding of support vector machins has landed use with another tool to tackle regression problems: the Support Vector Regressor (SVR). In this video, we'll give a couple of examples of applying the SVR with scikit-learn
.
- Become familiar with the SVR model in
scikit-learn
. - Demonstrate SVR functionality on a visual dataset.
- Identify important hyperparameters and instance attributes.
- Pick out the support vectors: spasification parameters
- Contrast with Linear Regression and Kernel Ridge Regression (KRR) on real-world data.
Unpacking Support Vector Regression Models
Let's begin by recalling our favourite function out of retirement:
We took some noisy samples from this function to demonstrate the KRR model. So to compare the both, let us test out the SVR on the same problem.
Neccessary import for our exercise:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
And here is the source of all truth:
def ground_truth(x):
return x + 2*np.sin(x) + np.sin(2*x) - 7
We'll distribute our data normally about .
np.random.seed(31)
# x-axis
x_train = np.linspace(0, 20, 101).reshape(-1, 1)
x_test = np.linspace(0, 20, 1001).reshape(-1, 1)
# Generate noisy data
y_train = np.random.normal(
loc=ground_truth(x_train),
scale=1.5,
).reshape(-1, 1)
What does this look like?
plt.scatter(x_train, y_train, marker='.')
plt.show()
Comparing the Least-Squares approach with SVR
This is how we clal the models in scikit-learn
:
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
PARAM_C = 10.0
PARAM_EP = 1.0
lr = LinearRegression()
svr = SVR(C=PARAM_C, epsilon=PARAM_EP, kernel='linear')
lr.fit(x_train, y_train)
svr.fit(x_train, y_train)
And the stats show they function relatively similarly
SVR Score: 0.924; Linear Score: 0.924
SVR Coeff: 0.946; Linear Coeff: 0.946
SVR Intercept: -6.527; Linear Coeff: -6.645
But what ab out sparsification? We can determine the number of training data samples needed to run the model by svr.n_support_
and running
ratio = float(svr.n_support_) / x_train.shape[0]
print(f'Percentage of dataset used in model: {ratio:.2%}')
prints the proportion used:
Percentage of dataset used in model: 67.33%
We can view this graphically to see that our linear model, whilst better performing, is still parametrically constrained.
Comparing Kernel Ridge Regression to SVR
In the video I'll talk about this useful plot function that we'll use to print our results onto an axes object.
# Useful plot function
def plot_preds(model1, model2, ax, title=None):
r2_scores = list()
for model in [model1, model2]:
print('Model: ', model)
r2_scores.append(model.score(x_test, ground_truth(x_test)))
print(f'R^2 Score: {r2_scores[-1]}\n')
scat = ax.scatter(x_train, y_train, marker='.')
m1, = ax.plot(x_test, model1.predict(x_test), 'r', linewidth=2)
m2, = ax.plot(x_test, model2.predict(x_test), 'k', alpha=0.5, linewidth=2)
true, = ax.plot(x_test, ground_truth(x_test), '--g')
plots = [scat, true, m1, m2]
labels = [
'Data',
'Ground Truth',
f'{type(model1).__name__}\nR^2 = {r2_scores[0]:.3f}',
f'{type(model2).__name__}\nR^2 = {r2_scores[1]:.3f}',
]
for model in [model1, model2]:
if hasattr(model, 'support_vectors_'):
svs = model.support_vectors_
sv_inds = [n for n, x in list(enumerate(x_train)) if x in svs]
sv_scatter = ax.scatter(
model.support_vectors_[:, 0],
y_train[sv_inds],
s=80,
facecolors='none',
zorder=10,
edgecolors='fuchsia',
)
plots.append(sv_scatter)
labels.append(
f'{int(model.n_support_)} {type(model).__name__} '
'Support Vectors'
)
else:
print('No support vectors for', type(model).__name__)
ax.legend(plots, labels)
if title:
ax.set_title(title)
Let's play the same game by utilise the power of Kernel SVR and compare it against the first non-parametric regression model we came across: KRR.
from sklearn.kernel_ridge import KernelRidge
from sklearn.svm import SVR
PARAM_C = 10
PARAM_EP = 1.2
GAMMA = 1.0
ALPHA = 1.0
krr = KernelRidge(kernel='rbf', gamma=GAMMA, alpha=ALPHA)
krr.fit(x_train, y_train)
svr = SVR(C=PARAM_C, epsilon=PARAM_EP, kernel='rbf', gamma=GAMMA)
svr.fit(x_train, y_train)
fig, ax = plt.subplots()
plot_preds(svr, krr, ax)
plt.show()
This quite wonderful plot shows us how the two models "bend" to the data. Remember that the SVR is less sensitive to data points within away. Using the same length scales on the kernels, we get different results on the scores.
Moreover, checking the support vectors (the pink outlined points), we find:
Percentage of dataset used in model: 48.51%
So the SVR model has sparsified the solution by over half. This is an excellent feature for large data problems.
Varying kernels throgh the model
We can inspect different kernels and experiment with the width of the tube. In each case we observe the supporting vectors controlling the final SVR solution.
Test on the 'Compressive Strength Dataset'
To understand how KRR and SVR compare pound-for-pound, let's test SVR on the compressive strength dataset. Now the most important thing is to tune the parameters for the best model. We can achieve this with a grid search as before.
The Linear SVR vs. Linear Regression
In the exercise we implement the same data set and check the validity as before. This time comparing the Linear Regressor with the Linear SVR. We are able to build a predictor that trims the MSE down, but given the problem is highly non-linear, the variance remains high.
However, the percentage of training data that becomes supporting is just 34%! A huge spasification decrease. Domain experts may then use this information to glean insight on specific samples.
Grid Searching with kernels
Now let's deploy the kernel SVR. We wish to search the parameter space to find a suitable (optimal) choice.
param_grid = [
{'degree': [2, 3, 4, 5, 6],
'C': [1e-1, 1.0, 10, 100],
'epsilon': [1e-1, 1.0, 10, 100],
'kernel': ['polynomial'],
},
{'gamma': [1e-2, 1e-1, 1, 10],
'C': [1e-1, 1.0, 10, 100],
'epsilon': [1e-1, 1.0, 10, 100],
'kernel': ['rbf'],
},
]
Then we execute a cross-validated grid search:
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
krr = SVR()
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=10, random_state=0)
search = GridSearchCV(
estimator=krr, param_grid=param_grid,
)
search.fit(X_train, y_train)
The results improve the value to which is as strong as the KRR. Interestingly, this algorithm helped to select the optimal kernel. That was the 'rbf'
or squared exponential, compared with the polynomial kernel selected by the KRR. This is suggestive of a different approach to features in the model once the epsilon tube erases those insignificant points.