Dr Andy Corbett

by Dr Andy Corbett

Lesson

Support Vector Machines

14. Solving Regression Problems with SVMs

We've seen the theory behind Support Vector Regression. We apply the opposite trick as we did for SVM classification: minimise the margine of error that contains the data points. Now lets implement this powerfull mehtod in code with some comparative examples.

📑 Learning Objectives
  • Reformulate the SVM objective to solve regression tasks.
  • Understand the 'ε\varepsilon-tube' used to account for low levels of noise.
  • Identify the parameters baked into the model construction.

Regression with a "thick" line


The core idea behind SVM regression is simple: instead of fitting your regressor tighly to all training data points, allow for a small margin of error and fit a 'thick' regression line; see Fig. 1.

SVM Regression Comparisson

Figure 1. Transference from linear regression (left) to a sparse model given by SVM regression (right).

Allowing a pre-determined "width" of noise, controlled by a parameter ε>0\varepsilon > 0, gives the SVM Regressor two handy properties:

  1. Sparsification: We discard the points within the ε\varepsilon tube in the final model. This means, for large datasets, we only need to carry around a far small number of data points, the "support vectors".

  2. Robustness: Allowing for the width reduces the impact of noise on the global solution--it helps to prevent over fitting. Data fluctuations which deviate only slightly from the prediction are discarded, allowing the model to train only at relevant length scales.

Both these points become more impactful for the non-parametric cousin of this model: the kernel SVM Regressor (see below).

Opposite to SVM classification

The extra width in an SVM Regressor can be viewed as the compliment to the SVM Classifier. In the classification context, we seek to maximise the margin between two classes of points. In the regression context, we want to keep as many points within the margin as possible.

Of course... this becomes non-parametric with the kernel trick

The SVM regressor model is constructed as a linear model with a special loss function: one in which only data points outside the 'regression margin' are penalised. In the solution, we find that the kernel trick can again be played to generate a powerful non-parametric predictive model. We shall experiment with using the SVM Regressor and the Kernel Regressor in the next video.

Hyper-parameter choices: epsilon-width and regularisation parameters


The gory details: SVM Regression differs from SVM Classifiers in the sense that we want to keep points inside the margine, rather than out. This means solvin the following constrained optimisation problem:

12wTw+C∑i=1N(ξi+ξi∗)\frac{1}{2} \mathbf{w}^{T}\mathbf{w} + C \sum_{i=1}^{N} (\xi_{i} + \xi_{i}^{*})

where we have two slack variables ξi,ξi∗≥0\xi_{i}, \xi_{i}^{*} \geq 0 that allow points to drift over the margin via the constraints

  • yi−wTÏ•(xi)−b≤ε+ξiy_{i} - \mathbf{w}^{T}\phi(\mathbf{x}_{i}) -b \leq \varepsilon + \xi_{i}
  • wTÏ•(xi)+b−yi≤ε+ξi∗\mathbf{w}^{T}\phi(\mathbf{x}_{i}) + b - y_{i} \leq \varepsilon + \xi_{i}^{*}
  • ξi,ξi∗≥0\xi_{i}, \xi_{i}^{*} \geq 0

As before, the parameter CC is inversely porportional to the regularisation strength and is set through the argument C. On the otherhand, we may also now set the variable ε\varepsilon which is the order of aleatoric uncertainty, or 'noise', around the signal--we are selecting the amount of noise that we tolerate in the model via the argument ε\varepsilon.