Support Vector Machines to Maximise Decision Margins

📑 Learning Objectives

Understanding how a Support Vector Machine Classifier works.
Finding the distance from a data cluster to the decision boundary.
Maximising classification margins.
Solving the constrained optimisation problem.
Identifying the support vectors.

Classification is not always clear cut. In Fig. 1, when life gives you two neatly separated piles of lemons (left), the decision boundary is simple, and SVMs will allow you to maximise the margin between them. When life mixes your lemons (right), we can still separate the piles allowing for the fewest number of mistakes.

Figure 1. Classification Examples.

Decision boundaries and supporting vectors

In classification tasks, we consider ways to label, or partition, data points into distinct categories called classes. For demonstration, let us first consider a binary classification task, for which we have $N$ measurements $\mathbf{x}_i$ of training data, each corresponding to a target value $y_i= \pm 1$ . The learning task is to build a linear function

f(\mathbf{x}) = \sum_{i}^{D}w_ix_i + b \tag{1}

where the parameters $\mathbf{w}=(w_1,\ldots,w_{D})$ and bias $b$ are optimised to fit the dataset $(\mathbf{x}_n, y_n)$ . This function will predict the line $f(\mathbf{x}) = 0$ between the two classes $\pm 1$ .

We shall see shortly that the linear-model assumption in SVMs may easily be made non-linear via the kernel trick.

Our strategy shall be to maximise the margin between the two classes; that is, to maximise the distance between $f(\mathbf{x})=0$ and the closest data point(s), perpendicularly. Notice that any correctly classified point will satisfy $y_n f(x_n) > 0$ . The margin distance is given by $y_n f(\mathbf{x}_n) / \|\mathbf{w}\|$ .

Figure. 2. The SVM's objective: maximise the margin width between the two classes based on the closest points to the central boundary.

This objective, to maximise the margin, may be reformulated as minimising the square-norm of the weights, $\frac{1}{2}\|\mathbf{w}\|^2$ , subject to the constraint that each point is on or outside the decision boundary.

An aside on rewriting the objective

For those interested, we can write our objective as $\arg\max_{\mathbf{w},b} \{\frac{1}{\|\mathbf{w}\|}\min_{n}(y_nf(\mathbf{x}_{n}))\}.$ Since we are free to rescale $\mathbf{w}$ and $b$ by constants, we can without loss of generality assume that $y_n f(x_n) \geq 1$ , with the minimum landing on $1$ exactly. The problem of maximising $\|\mathbf{w}\|^{-1}$ is then equivalent to minimising $\|\mathbf{w}\|^{2}$ .

Solution to the optimisation problem

The answer to this problem is quite brilliant and goes back to ideas of mathematicians Lagrange and Euler in the 18th century. The constraint is inserted into the objective by multiplying by constants $a_n$ for each $n$ . We start with the model in Eq. (1). Differentiating this equation by $\mathbf{w}$ and $b$ and setting the result to zero transforms the problem into a 'dual' form, whose solution takes the form

f(\mathbf{x}) = \sum_{n=1}^{N}a_n \cdot y_n \cdot (\mathbf{x}^{T} \cdot \mathbf{x}_{n}) + b \tag{2}

where $a_n$ denotes the Lagrange multiplier corresponding to the $n$ -th data point alongside the constraints

a_n \geq 0; \quad y_nf(\mathbf{x}_n) - 1 \geq 0; \quad a_n(y_nf(\mathbf{x}_n) - 1)= 0.

The final constraint implies that either $y_n\mathbf{x}=1$ , or $a_n=0$ and $\mathbf{x}_n$ does not contribute to the final model sum $y(\mathbf{x})$ .

This concept of margin introduces the notion of a special set of vectors, support vectors, that govern the final model. These are the vectors that sit on the edge of the margin, with $a_n\neq 0$ . This solution is important for two reasons:

The solution is sparse, and very lightweight. Even if we had a tremendously large amount of data ( $N$ ), the set of support vectors used in inference is typically far smaller.
The support vectors make the (eventually) non-parametric model interpretable: they are the data points most delicately affecting the classification and can highlight unseen structure in the data.

Support Vector Machines

8. Support Vector Machines to Maximise Decision Margins

Decision boundaries and supporting vectors

An aside on rewriting the objective

Solution to the optimisation problem

Return to Lesson Index

Next Lesson