Looking through the Right Lens - Principle Component Analysis

In this explainer we look out our first unsupervised learning algorithm which is used for dimension reduction, called Principal Component Analysis (PCA). By the end of this sections you will

Understand the key principles of PCA, what it is and why would it be used.
Understand the key concepts of what is actually calculated when PCA is run.
Introduce the key concepts of explained variance (ratio) for choosing the amount of model reduction.
(Optional) Understand why the principal components are eigenvectors of the covariance matrix of your data set.
Walkthrough an example of PCA on a real world data set.

Principal Component Analysis (PCA) is a linear dimesion reduction method. The method is unsupervised, and so we just have a set of data

{\bf X} := \{ {\bf x}_1, {\bf x}_2, \ldots, {\bf x}_N \}

and no labels. Each of the samples are vector representations of the input, so that ${\bf x}_j \in \mathbb R^D$ , this means the sample is represented by D-numbers, or put another way is D-dimensional input / sample space. Often input data can be very high dimensional, i.e $D$ is very large.

A good example of this, which often encounters are images. Take for an example a picture made up of 640 by 480 pixels, in which each pixel is represented by 3 color values - Red, Green, Blue (RGB). This gives a total number of dimensions of $640 \times 480 \times 3 = 921,600$ .

High dimensional data is often challenging to analyse, interpret, visualise and store. Whilst data might be represented in a high-dimensional way, the data is often much more structured and correlation than this.

Dimension Reduction methods seek to exploit this structure and correlations to give a compact, lower dimensional representation of the data, without losing information.

We are interested in finding a compact representation of the data ${\bf z}_j \in \mathbb R^M$ where $M$ is much less $N$ the original dimension of ${\bf x}_j$ . This dimension reduction is achieved by a matrix operation ${\bf V}^T$ , such that

{\bf z}_j = {\bf V}^T{\bf x}_j

where ${\bf V} := [ {\bf v}_1, \ldots, {\bf v}_M ] \in \mathbb R^{D \times M}$ , and the the reconstruction given by the matrix operation

\tilde{{\bf x}}_j = {\bf V}{\bf z}_j.

To visualise these matrix operations we have

PCA is a linear dimension reduction method, since components of the new representation ${\bf z}$ are linear combinations of full representation ${\bf x}$ . The question is now how do we pick the columns ${\bf v}_j$ . We choose two:

Choose directions ${\bf v}_j$ so capture maximum variance of the original data set $X$ .
Choose coloumns ${\bf v}_j$ and ${\bf v}_i$ so they are ortogonal ${\bf v}_j \cdot {\bf v}_i = 0$ .

To visualise what this looks like on a simple data set, we see two principal components. The principal component pointing the direction of the greatest variance, and the second component orthogonal (i.e. at 90 degrees to the first).

The principal components ${\bf v}_j$ are actually eigenvectors of the covariance matrix

{\bf S} = \frac{1}{N}\sum_{j=1}^N{\bf x}_{j}{\bf x}_j^{T},

i.e. they are solutions of the equation

{\bf S}{\bf v} = \lambda {\bf v}.

The matrix ${\bf S}$ is a symmetric matrix, with $D$ eigenvectors and associated eigenvalues, such that

\lambda_1 > \lambda_2 > \ldots > \lambda_D.

The eigenvalues give us a value of the contribution towards the total variance an eigenvector makes, this is also called the explained variance

\alpha_i = \frac{\lambda_i}{\sum_{i=1}^D \lambda_i}.

Plotting $\alpha_i$ against eigenvalues, shows a natural cutoff of how many principal modes should be include $M$ in the reduce order model, without losing to much information.

To show a conceptal plot of what this looks like, see below.

The plot shows individual contributions of the principal components to explained variance (as a percentage). The orange curve shows the culmative effect. So what we mean by this is that if you include the first three components you will include 90% of the variance from the original data set.

We will show a practical example of in the python example walkthrough for PCA.

PCA and General Linear Models

In linear models, we extended the concept away from simple linear functions e.g. $f(x) = w_0 + w_1x$ , to more general representations of linear combinations of general functions, so that

f(x) = w_0 + w_1\phi_1(x) + \ldots w_2\phi_M(x) = \sum_{i=1}^M w_i\phi_i(x)

where we call $\phi_i(x)$ the basis functions.

So an interest point comes that how do you choose the basis functions. You could engineer them from experience. So, for example, if you were modelling a system that was dependent on a tide for example, you could easily put in some sensible periodic functions perhaps.

In general you might not know what to choose, and for this the principal components become a natural choice. So this means we take

\phi_i({\bf x}) = v_i^T{\bf x} = z_i

so this would be a sum of just linear features, but could be generally extended to higher order polynomials by taking

\phi_{(k-1)M + i}({\bf x}) = z_i^k.

Concluding Remarks

PCA is an unsupervised learning algorithm. It is a parameteric model, since it imposes a linear project / transformation of the data only.

It is very useful method which is widely used in the data preparation, input representation and explorations stages of a machine learning workflow. As an AI practioner it tells me how correlated my inputs / features are that I am going to use.

We see that in many machine learning workflows that algorithms are either no unsupervised or supervised, but combine the good bits of both! For example, we have show how we can use PCA to inform the basis functions we use in a General Linear Models.

Bottomline, PCA is a very useful tool to have in your toolbox. Implementation is packages like sklearn mean that implementations are just a few lines of code.

Bonus Information

A Bit of Maths - Not required for using PCA. There are packages that implementing PCA, so you don't worry about how to calculate the eigenvectors and values of ${\bf S}$ , if you don't want to. This becomes more complex when data sets or dimensions are large. For those interested in the question "Why are the principal components the eigenvectors of ${\bf S}$ ?". . . Here it is.

We wish to maximise variance of the first reduce coordinate $z_1$ of ${\bf z}$ which is

V_1 = \frac{1}{N}\sum_{j=1}^N z_{1j}^2

is maximised. Noting that $z_{1j} = {\bf v}_1^T{\bf x}_{j}$ . This is to project of sample ${\bf x}_{j}$ into the direction ${\bf v}_1$ . So we can rewrite the variance

V_1 = \frac{1}{N}\sum_{j=1}^N({\bf v}_1^T{\bf x}_{j})^2 = \frac{1}{N}\sum_{j=1}^N {\bf v}_1^T{\bf x}_{j}{\bf x}_j^{T}{\bf v}_1 = {\bf v}_1^T\left(\frac{1}{N}\sum_{j=1}^N{\bf x}_{j}{\bf x}_j^{T}\right){\bf v}_1 = {\bf v}_1^T{\bf S}{\bf v}_1.

We wish find ${\bf v}_1$ which maximises this variance, but we note that we can just make the magnitude of ${\bf v}_1$ arbitarily large, so this problem is ill-posed. We therefore have to solve a constrained maximisation problem, such that

\max_{{\bf v}_1} \left[{\bf v}_1^T{\bf S}{\bf v}_1 \right]\quad \text{such that} \quad \|{\bf v}_1\|^2 = {\bf v}_1^T {\bf v}_1 = 1.

This then becomes a constrained optimisation problem. So we write down the Lagrangian

\mathcal L = {\bf v}_1^T{\bf S}{\bf v}_1 + \lambda_1 \left(1 - {\bf v}_1^T {\bf v}_1 \right).

We can then find maximisers of $\lambda$ by differentiating with respect to ${\bf v}_1$ and $\lambda_1$ , so that

\frac{\partial\mathcal L}{\partial{\bf v}} = 2{\bf v}_1^T{\bf S} - 2\lambda_1{\bf v}_1^T \quad \text{and} \quad \frac{\partial\mathcal L}{\partial\lambda} = 1 - {\bf v}_1^T {\bf v}_1

Setting these partial derivatives to zero which gives us our eigenproblem

{\bf S}{\bf v}_1 = \lambda_1 {\bf v}_1 \quad \text{and} \quad {\bf v}_1^T{\bf v}_1 = 1.

subsequent eigenvalues will then give us maximisers of $\mathcal L$ orthgonal to $v_{1}$ . Hence we get the order sequence of eigenvectors (principal components).

We don't show it hear but is possible to show that maximising the variance of the projected data is the same as minimising the reconstruction error over the data set, i.e.

\mathcal J = \frac{1}{N}\sum_{j=1}^N\|{\bf x}_j - \tilde{{\bf x}}_j\|^2 = \frac{1}{N}\sum_{j=1}^N\|{\bf x}_j - {\bf V}{\bf z}_j\|^2

General Linear Models

11. Looking through the Right Lens - Principle Component Analysis

PCA and General Linear Models

Concluding Remarks

Bonus Information

Return to Lesson Index

Next Lesson