Dr Andy Corbett

by Dr Andy Corbett

Lesson

Support Vector Machines

15. Comparing Least-Squares with SVM Regression

📂 Resources

Download the resources for this lesson here.

Our understanding of support vector machins has landed use with another tool to tackle regression problems: the Support Vector Regressor (SVR). In this video, we'll give a couple of examples of applying the SVR with scikit-learn.

📑 Learning Objectives
  • Become familiar with the SVR model in scikit-learn.
  • Demonstrate SVR functionality on a visual dataset.
  • Identify important hyperparameters and instance attributes.
  • Pick out the support vectors: spasification parameters
  • Contrast with Linear Regression and Kernel Ridge Regression (KRR) on real-world data.

Unpacking Support Vector Regression Models


Let's begin by recalling our favourite function out of retirement:

f(x)=x+2sin(x)+sin(2x)7.f(x) = x + 2\sin(x) + \sin(2x) - 7.

We took some noisy samples from this function to demonstrate the KRR model. So to compare the both, let us test out the SVR on the same problem.

Neccessary import for our exercise:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline

And here is the source of all truth:

def ground_truth(x):
    return x + 2*np.sin(x) + np.sin(2*x) - 7

We'll distribute our data normally about yN(f(x),1)y\sim \mathcal{N}(f(x), 1).

np.random.seed(31)

# x-axis
x_train = np.linspace(0, 20, 101).reshape(-1, 1)
x_test = np.linspace(0, 20, 1001).reshape(-1, 1)

# Generate noisy data
y_train = np.random.normal(
    loc=ground_truth(x_train),
    scale=1.5,
).reshape(-1, 1)

What does this look like?

plt.scatter(x_train, y_train, marker='.')
plt.show()
Example Data

Comparing the Least-Squares approach with SVR

This is how we clal the models in scikit-learn:

from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR

PARAM_C = 10.0
PARAM_EP = 1.0

lr = LinearRegression()
svr = SVR(C=PARAM_C, epsilon=PARAM_EP, kernel='linear')

lr.fit(x_train, y_train)
svr.fit(x_train, y_train)

And the stats show they function relatively similarly

SVR Score: 0.924; Linear Score: 0.924
SVR Coeff: 0.946; Linear Coeff: 0.946
SVR Intercept: -6.527; Linear Coeff: -6.645

But what ab out sparsification? We can determine the number of training data samples needed to run the model by svr.n_support_ and running

ratio = float(svr.n_support_) / x_train.shape[0]
print(f'Percentage of dataset used in model: {ratio:.2%}')

prints the proportion used:

Percentage of dataset used in model: 67.33%

We can view this graphically to see that our linear model, whilst better performing, is still parametrically constrained.

SVM Regression Comparisson

Comparing Kernel Ridge Regression to SVR

In the video I'll talk about this useful plot function that we'll use to print our results onto an axes object.

# Useful plot function

def plot_preds(model1, model2, ax, title=None):

    r2_scores = list()
    for model in [model1, model2]:
        print('Model: ', model)
        r2_scores.append(model.score(x_test, ground_truth(x_test)))
        print(f'R^2 Score: {r2_scores[-1]}\n')

    scat = ax.scatter(x_train, y_train, marker='.')
    m1, = ax.plot(x_test, model1.predict(x_test), 'r', linewidth=2)
    m2, = ax.plot(x_test, model2.predict(x_test), 'k', alpha=0.5, linewidth=2)
    true, = ax.plot(x_test, ground_truth(x_test), '--g')

    plots = [scat, true, m1, m2]
    labels = [
        'Data',
        'Ground Truth',
        f'{type(model1).__name__}\nR^2 = {r2_scores[0]:.3f}',
        f'{type(model2).__name__}\nR^2 = {r2_scores[1]:.3f}',
    ]

    for model in [model1, model2]:
        if hasattr(model, 'support_vectors_'):
            svs = model.support_vectors_
            sv_inds = [n for n, x in list(enumerate(x_train)) if x in svs]
            sv_scatter = ax.scatter(
                model.support_vectors_[:, 0],
                y_train[sv_inds],
                s=80,
                facecolors='none',
                zorder=10,
                edgecolors='fuchsia',
            )
            plots.append(sv_scatter)
            labels.append(
                f'{int(model.n_support_)} {type(model).__name__} '
                'Support Vectors'
            )
        else:
            print('No support vectors for', type(model).__name__)

    ax.legend(plots, labels)

    if title:
        ax.set_title(title)

Let's play the same game by utilise the power of Kernel SVR and compare it against the first non-parametric regression model we came across: KRR.

from sklearn.kernel_ridge import KernelRidge
from sklearn.svm import SVR

PARAM_C = 10
PARAM_EP = 1.2
GAMMA = 1.0
ALPHA = 1.0

krr = KernelRidge(kernel='rbf', gamma=GAMMA, alpha=ALPHA)
krr.fit(x_train, y_train)

svr = SVR(C=PARAM_C, epsilon=PARAM_EP, kernel='rbf', gamma=GAMMA)
svr.fit(x_train, y_train)

fig, ax = plt.subplots()
plot_preds(svr, krr, ax)
plt.show()
RBF-Kernel SVR

This quite wonderful plot shows us how the two models "bend" to the data. Remember that the SVR is less sensitive to data points within ε=1.2\varepsilon=1.2 away. Using the same length scales on the kernels, we get different results on the R2R^2 scores.

Moreover, checking the support vectors (the pink outlined points), we find:

Percentage of dataset used in model: 48.51%

So the SVR model has sparsified the solution by over half. This is an excellent feature for large data problems.

Varying kernels throgh the model

We can inspect different kernels and experiment with the width ε\varepsilon of the tube. In each case we observe the supporting vectors controlling the final SVR solution.

Comparing Kernels

Test on the 'Compressive Strength Dataset'


To understand how KRR and SVR compare pound-for-pound, let's test SVR on the compressive strength dataset. Now the most important thing is to tune the parameters for the best model. We can achieve this with a grid search as before.

The Linear SVR vs. Linear Regression

In the exercise we implement the same data set and check the validity as before. This time comparing the Linear Regressor with the Linear SVR. We are able to build a predictor that trims the MSE down, but given the problem is highly non-linear, the variance remains high.

However, the percentage of training data that becomes supporting is just 34%! A huge spasification decrease. Domain experts may then use this information to glean insight on specific samples.

Linear SVR Predictor

Grid Searching with kernels

Now let's deploy the kernel SVR. We wish to search the parameter space to find a suitable (optimal) choice.

param_grid = [
    {'degree': [2, 3, 4, 5, 6],
     'C': [1e-1, 1.0, 10, 100],
     'epsilon': [1e-1, 1.0, 10, 100],
     'kernel': ['polynomial'],
    },
    {'gamma': [1e-2, 1e-1, 1, 10],
     'C': [1e-1, 1.0, 10, 100],
     'epsilon': [1e-1, 1.0, 10, 100],
     'kernel': ['rbf'],
    },
]

Then we execute a cross-validated grid search:

from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold

krr = SVR()
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=10, random_state=0)
search = GridSearchCV(
    estimator=krr, param_grid=param_grid,
)
search.fit(X_train, y_train)

The results improve the R2R^2 value to 88%88\% which is as strong as the KRR. Interestingly, this algorithm helped to select the optimal kernel. That was the 'rbf' or squared exponential, compared with the polynomial kernel selected by the KRR. This is suggestive of a different approach to features in the model once the epsilon tube erases those insignificant points.