Prof Tim Dodwell

by Prof Tim Dodwell

Lesson

General Linear Models

12. Looking in the Right Direction - Exploring PCA

In this quick explainer we are walkthrough our first simple example of Principal Component Analysis.

We will see how to compute the principal components, and then how to transform our data to be represented in this new space, and then build a linear model in this new feature space.

Add libaries


import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

The data we will use in this example is some red wine quality data. Lots of different measurements, and then measure of their quality out of 10 as the target value.

In total we have data for 1599 wines. So let's load that data set and pull out some random samples

df = pd.read_csv('winequality-red.csv', sep=";")
df.sample(5)
academy.digilab.co.uk

So now we pull out our target variable quality, and then do the standard train / test split.

from sklearn.model_selection import train_test_split

target = 'quality'
X = df.drop(target,axis=1)
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Re-scaling your data


Again like most data science work flows rescaling data is an important step. If we don't do this variables which happen to have higher values (because of what they measure) will bias the results. We therefore rescale all input parameters between a range of 0 and 1.

from sklearn.preprocessing import scale

X_train_scaled = scale(X_train)
X_test_scaled = scale(X_test)

Calculate Principal Components


Calling PCA in sklearn is nice and easy.

from sklearn.decomposition import PCA

pca = PCA() # Default n_components = min(n_samples, n_features)
X_train_pc = pca.fit_transform(X_train_scaled)
X_test_pc = pca.transform(X_test_scaled)

Once we have fitted the pca transformation, behind the scenes sklearn will have computed the explained_variance ratio. We can then also calculate the accumulated explained variance using np.cumsum().

explained_var = pca.explained_variance_ratio_

accum_var = np.cumsum(explained_var)

Once calculated we can make a nice two y-axis plot.

fig, ax1 = plt.subplots()

ax2 = ax1.twinx()

ax1.plot(np.arange(0, explained_var.shape[0]), explained_var, '-ob', alpha = 0.5)

ax2.plot(np.arange(0, explained_var.shape[0]), accum_var, '-o',c='lightcoral', alpha = 0.5)

ax1.set_xlabel('Principal Component')
ax1.set_ylabel('Explained Variance', color='b')
ax2.set_ylabel('Accumulated Variance', color='lightcoral')
academy.digilab.co.uk

Looking at High Dimension data :)


fig, ax = plt.subplots(figsize=(6, 4))

unique_label = np.unique(y)
num_labels = len(unique_labels)

plt.scatter(X_train_pc[:,0], X_train_pc[:,1], c=y_train, alpha = 0.7, cmap='coolwarm')

cbar = plt.colorbar()
cbar.solids.set_edgecolor("face")


ax.legend()
ax.set_xlabel('Principal Component 0')
ax.set_ylabel('Principal Component 1')
academy.digilab.co.uk

Build a Predictive Model


So here we are going to build a predictive model using Gradient Boosted Classifier. These are generally the best simplest multi-class classifiers to use out the box. They are my go to.

from sklearn.ensemble import GradientBoostingClassifier

perf = []
num_pca = []

for i in np.arange(1,11):

    clf = GradientBoostingClassifier(n_estimators=2000, learning_rate=1.0, max_depth=2, random_state=0).fit(X_train_pc[:,:i], y_train)

    num_pca.append(i)
    perf.append(clf.score(X_test_pc[:,:i], y_test))

Let's pull out another double axis plot.

fig, ax1 = plt.subplots()

ax2 = ax1.twinx()


ax1.plot(num_pca, perf, '-ob', alpha = 0.5)

ax2.plot(num_pca, (1.0 - perf/perf[-1])*100., '-o',c='lightcoral', alpha = 0.5)


ax1.set_xlabel('Principal Component')
ax1.set_ylabel('Score', color='b')
ax2.set_ylabel('Loss in Performance', color='lightcoral')
academy.digilab.co.uk

"Score" is a harsh metric for performance for the algorithm. So probably more important look at the relative metrics. We see that even with just 2 principal components we can obtain model with a score with 10% of the full input model.