Prof Tim Dodwell

by Prof Tim Dodwell

Lesson

General Linear Models

11. Looking through the Right Lens - Principle Component Analysis

In this explainer we look out our first unsupervised learning algorithm which is used for dimension reduction, called Principal Component Analysis (PCA). By the end of this sections you will

  1. Understand the key principles of PCA, what it is and why would it be used.
  2. Understand the key concepts of what is actually calculated when PCA is run.
  3. Introduce the key concepts of explained variance (ratio) for choosing the amount of model reduction.
  4. (Optional) Understand why the principal components are eigenvectors of the covariance matrix of your data set.
  5. Walkthrough an example of PCA on a real world data set.

Principal Component Analysis (PCA) is a linear dimesion reduction method. The method is unsupervised, and so we just have a set of data

X:={x1,x2,,xN}{\bf X} := \{ {\bf x}_1, {\bf x}_2, \ldots, {\bf x}_N \}

and no labels. Each of the samples are vector representations of the input, so that xjRD{\bf x}_j \in \mathbb R^D, this means the sample is represented by D-numbers, or put another way is D-dimensional input / sample space. Often input data can be very high dimensional, i.e DD is very large.

A good example of this, which often encounters are images. Take for an example a picture made up of 640 by 480 pixels, in which each pixel is represented by 3 color values - Red, Green, Blue (RGB). This gives a total number of dimensions of 640×480×3=921,600640 \times 480 \times 3 = 921,600.

High dimensional data is often challenging to analyse, interpret, visualise and store. Whilst data might be represented in a high-dimensional way, the data is often much more structured and correlation than this.

Dimension Reduction methods seek to exploit this structure and correlations to give a compact, lower dimensional representation of the data, without losing information.

academy.digilab.co.uk

We are interested in finding a compact representation of the data zjRM{\bf z}_j \in \mathbb R^M where MM is much less NN the original dimension of xj{\bf x}_j. This dimension reduction is achieved by a matrix operation VT{\bf V}^T, such that

zj=VTxj{\bf z}_j = {\bf V}^T{\bf x}_j

where V:=[v1,,vM]RD×M{\bf V} := [ {\bf v}_1, \ldots, {\bf v}_M ] \in \mathbb R^{D \times M}, and the the reconstruction given by the matrix operation

x~j=Vzj.\tilde{{\bf x}}_j = {\bf V}{\bf z}_j.

To visualise these matrix operations we have

academy.digilab.co.uk

PCA is a linear dimension reduction method, since components of the new representation z{\bf z} are linear combinations of full representation x{\bf x}. The question is now how do we pick the columns vj{\bf v}_j. We choose two:

  • Choose directions vj{\bf v}_j so capture maximum variance of the original data set XX.
  • Choose coloumns vj{\bf v}_j and vi{\bf v}_i so they are ortogonal vjvi=0{\bf v}_j \cdot {\bf v}_i = 0.

To visualise what this looks like on a simple data set, we see two principal components. The principal component pointing the direction of the greatest variance, and the second component orthogonal (i.e. at 90 degrees to the first).

academy.digilab.co.uk

The principal components vj{\bf v}_j are actually eigenvectors of the covariance matrix

S=1Nj=1NxjxjT,{\bf S} = \frac{1}{N}\sum_{j=1}^N{\bf x}_{j}{\bf x}_j^{T},

i.e. they are solutions of the equation

Sv=λv.{\bf S}{\bf v} = \lambda {\bf v}.

The matrix S{\bf S} is a symmetric matrix, with DD eigenvectors and associated eigenvalues, such that

λ1>λ2>>λD.\lambda_1 > \lambda_2 > \ldots > \lambda_D.

The eigenvalues give us a value of the contribution towards the total variance an eigenvector makes, this is also called the explained variance

αi=λii=1Dλi.\alpha_i = \frac{\lambda_i}{\sum_{i=1}^D \lambda_i}.

Plotting αi\alpha_i against eigenvalues, shows a natural cutoff of how many principal modes should be include MM in the reduce order model, without losing to much information.

To show a conceptal plot of what this looks like, see below.

academy.digilab.co.uk

The plot shows individual contributions of the principal components to explained variance (as a percentage). The orange curve shows the culmative effect. So what we mean by this is that if you include the first three components you will include 90% of the variance from the original data set.

We will show a practical example of in the python example walkthrough for PCA.

PCA and General Linear Models


In linear models, we extended the concept away from simple linear functions e.g. f(x)=w0+w1xf(x) = w_0 + w_1x, to more general representations of linear combinations of general functions, so that

f(x)=w0+w1ϕ1(x)+w2ϕM(x)=i=1Mwiϕi(x)f(x) = w_0 + w_1\phi_1(x) + \ldots w_2\phi_M(x) = \sum_{i=1}^M w_i\phi_i(x)

where we call ϕi(x)\phi_i(x) the basis functions.

So an interest point comes that how do you choose the basis functions. You could engineer them from experience. So, for example, if you were modelling a system that was dependent on a tide for example, you could easily put in some sensible periodic functions perhaps.

In general you might not know what to choose, and for this the principal components become a natural choice. So this means we take

ϕi(x)=viTx=zi\phi_i({\bf x}) = v_i^T{\bf x} = z_i

so this would be a sum of just linear features, but could be generally extended to higher order polynomials by taking

ϕ(k1)M+i(x)=zik.\phi_{(k-1)M + i}({\bf x}) = z_i^k.

Concluding Remarks


PCA is an unsupervised learning algorithm. It is a parameteric model, since it imposes a linear project / transformation of the data only.

It is very useful method which is widely used in the data preparation, input representation and explorations stages of a machine learning workflow. As an AI practioner it tells me how correlated my inputs / features are that I am going to use.

We see that in many machine learning workflows that algorithms are either no unsupervised or supervised, but combine the good bits of both! For example, we have show how we can use PCA to inform the basis functions we use in a General Linear Models.

Bottomline, PCA is a very useful tool to have in your toolbox. Implementation is packages like sklearn mean that implementations are just a few lines of code.

Bonus Information


A Bit of Maths - Not required for using PCA. There are packages that implementing PCA, so you don't worry about how to calculate the eigenvectors and values of S{\bf S}, if you don't want to. This becomes more complex when data sets or dimensions are large. For those interested in the question "Why are the principal components the eigenvectors of S{\bf S}?". . . Here it is.

We wish to maximise variance of the first reduce coordinate z1z_1 of z{\bf z} which is

V1=1Nj=1Nz1j2V_1 = \frac{1}{N}\sum_{j=1}^N z_{1j}^2

is maximised. Noting that z1j=v1Txjz_{1j} = {\bf v}_1^T{\bf x}_{j}. This is to project of sample xj{\bf x}_{j} into the direction v1{\bf v}_1. So we can rewrite the variance

V1=1Nj=1N(v1Txj)2=1Nj=1Nv1TxjxjTv1=v1T(1Nj=1NxjxjT)v1=v1TSv1.V_1 = \frac{1}{N}\sum_{j=1}^N({\bf v}_1^T{\bf x}_{j})^2 = \frac{1}{N}\sum_{j=1}^N {\bf v}_1^T{\bf x}_{j}{\bf x}_j^{T}{\bf v}_1 = {\bf v}_1^T\left(\frac{1}{N}\sum_{j=1}^N{\bf x}_{j}{\bf x}_j^{T}\right){\bf v}_1 = {\bf v}_1^T{\bf S}{\bf v}_1.

We wish find v1{\bf v}_1 which maximises this variance, but we note that we can just make the magnitude of v1{\bf v}_1 arbitarily large, so this problem is ill-posed. We therefore have to solve a constrained maximisation problem, such that

maxv1[v1TSv1]such thatv12=v1Tv1=1.\max_{{\bf v}_1} \left[{\bf v}_1^T{\bf S}{\bf v}_1 \right]\quad \text{such that} \quad \|{\bf v}_1\|^2 = {\bf v}_1^T {\bf v}_1 = 1.

This then becomes a constrained optimisation problem. So we write down the Lagrangian

L=v1TSv1+λ1(1v1Tv1).\mathcal L = {\bf v}_1^T{\bf S}{\bf v}_1 + \lambda_1 \left(1 - {\bf v}_1^T {\bf v}_1 \right).

We can then find maximisers of λ\lambda by differentiating with respect to v1{\bf v}_1 and λ1\lambda_1, so that

Lv=2v1TS2λ1v1TandLλ=1v1Tv1\frac{\partial\mathcal L}{\partial{\bf v}} = 2{\bf v}_1^T{\bf S} - 2\lambda_1{\bf v}_1^T \quad \text{and} \quad \frac{\partial\mathcal L}{\partial\lambda} = 1 - {\bf v}_1^T {\bf v}_1

Setting these partial derivatives to zero which gives us our eigenproblem

Sv1=λ1v1andv1Tv1=1.{\bf S}{\bf v}_1 = \lambda_1 {\bf v}_1 \quad \text{and} \quad {\bf v}_1^T{\bf v}_1 = 1.

subsequent eigenvalues will then give us maximisers of L\mathcal L orthgonal to v1v_{1}. Hence we get the order sequence of eigenvectors (principal components).

We don't show it hear but is possible to show that maximising the variance of the projected data is the same as minimising the reconstruction error over the data set, i.e.

J=1Nj=1Nxjx~j2=1Nj=1NxjVzj2\mathcal J = \frac{1}{N}\sum_{j=1}^N\|{\bf x}_j - \tilde{{\bf x}}_j\|^2 = \frac{1}{N}\sum_{j=1}^N\|{\bf x}_j - {\bf V}{\bf z}_j\|^2