Dr Andy Corbett

by Dr Andy Corbett

Lesson

Support Vector Machines

8. Support Vector Machines to Maximise Decision Margins

📑 Learning Objectives
  • Understanding how a Support Vector Machine Classifier works.
  • Finding the distance from a data cluster to the decision boundary.
  • Maximising classification margins.
  • Solving the constrained optimisation problem.
  • Identifying the support vectors.

Classification is not always clear cut. In Fig. 1, when life gives you two neatly separated piles of lemons (left), the decision boundary is simple, and SVMs will allow you to maximise the margin between them. When life mixes your lemons (right), we can still separate the piles allowing for the fewest number of mistakes.

Classification Examples

Figure 1. Classification Examples.

Decision boundaries and supporting vectors


In classification tasks, we consider ways to label, or partition, data points into distinct categories called classes. For demonstration, let us first consider a binary classification task, for which we have NN measurements xi\mathbf{x}_i of training data, each corresponding to a target value yi=±1y_i= \pm 1. The learning task is to build a linear function

f(x)=iDwixi+b(1)f(\mathbf{x}) = \sum_{i}^{D}w_ix_i + b \tag{1}

where the parameters w=(w1,,wD)\mathbf{w}=(w_1,\ldots,w_{D}) and bias bb are optimised to fit the dataset (xn,yn)(\mathbf{x}_n, y_n). This function will predict the line f(x)=0f(\mathbf{x}) = 0 between the two classes ±1\pm 1.

We shall see shortly that the linear-model assumption in SVMs may easily be made non-linear via the kernel trick.

Our strategy shall be to maximise the margin between the two classes; that is, to maximise the distance between f(x)=0f(\mathbf{x})=0 and the closest data point(s), perpendicularly. Notice that any correctly classified point will satisfy ynf(xn)>0y_n f(x_n) > 0. The margin distance is given by ynf(xn)/wy_n f(\mathbf{x}_n) / \|\mathbf{w}\|.

Maximise the Margin

Figure. 2. The SVM's objective: maximise the margin width between the two classes based on the closest points to the central boundary.

This objective, to maximise the margin, may be reformulated as minimising the square-norm of the weights, 12w2\frac{1}{2}\|\mathbf{w}\|^2, subject to the constraint that each point is on or outside the decision boundary.

An aside on rewriting the objective

For those interested, we can write our objective as argmaxw,b{1wminn(ynf(xn))}.\arg\max_{\mathbf{w},b} \{\frac{1}{\|\mathbf{w}\|}\min_{n}(y_nf(\mathbf{x}_{n}))\}. Since we are free to rescale w\mathbf{w} and bb by constants, we can without loss of generality assume that ynf(xn)1y_n f(x_n) \geq 1, with the minimum landing on 11 exactly. The problem of maximising w1\|\mathbf{w}\|^{-1} is then equivalent to minimising w2\|\mathbf{w}\|^{2}.

Solution to the optimisation problem


The answer to this problem is quite brilliant and goes back to ideas of mathematicians Lagrange and Euler in the 18th century. The constraint is inserted into the objective by multiplying by constants ana_n for each nn. We start with the model in Eq. (1). Differentiating this equation by w\mathbf{w} and bb and setting the result to zero transforms the problem into a 'dual' form, whose solution takes the form

f(x)=n=1Nanyn(xTxn)+b(2)f(\mathbf{x}) = \sum_{n=1}^{N}a_n \cdot y_n \cdot (\mathbf{x}^{T} \cdot \mathbf{x}_{n}) + b \tag{2}

where ana_n denotes the Lagrange multiplier corresponding to the nn-th data point alongside the constraints

an0;ynf(xn)10;an(ynf(xn)1)=0.a_n \geq 0; \quad y_nf(\mathbf{x}_n) - 1 \geq 0; \quad a_n(y_nf(\mathbf{x}_n) - 1)= 0.

The final constraint implies that either ynx=1y_n\mathbf{x}=1, or an=0a_n=0 and xn\mathbf{x}_n does not contribute to the final model sum y(x)y(\mathbf{x}).

This concept of margin introduces the notion of a special set of vectors, support vectors, that govern the final model. These are the vectors that sit on the edge of the margin, with an0a_n\neq 0. This solution is important for two reasons:

  • The solution is sparse, and very lightweight. Even if we had a tremendously large amount of data (NN), the set of support vectors used in inference is typically far smaller.
  • The support vectors make the (eventually) non-parametric model interpretable: they are the data points most delicately affecting the classification and can highlight unseen structure in the data.