by Prof Tim Dodwell
General Linear Models
12. Looking in the Right Direction - Exploring PCA
In this quick explainer we are walkthrough our first simple example of Principal Component Analysis.
We will see how to compute the principal components, and then how to transform our data to be represented in this new space, and then build a linear model in this new feature space.
Add libaries
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
The data we will use in this example is some red wine quality data. Lots of different measurements, and then measure of their quality out of 10 as the target value.
In total we have data for 1599 wines. So let's load that data set and pull out some random samples
df = pd.read_csv('winequality-red.csv', sep=";")
df.sample(5)
So now we pull out our target variable quality, and then do the standard train / test split.
from sklearn.model_selection import train_test_split
target = 'quality'
X = df.drop(target,axis=1)
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Re-scaling your data
Again like most data science work flows rescaling data is an important step. If we don't do this variables which happen to have higher values (because of what they measure) will bias the results. We therefore rescale all input parameters between a range of 0 and 1.
from sklearn.preprocessing import scale
X_train_scaled = scale(X_train)
X_test_scaled = scale(X_test)
Calculate Principal Components
Calling PCA in sklearn is nice and easy.
from sklearn.decomposition import PCA
pca = PCA() # Default n_components = min(n_samples, n_features)
X_train_pc = pca.fit_transform(X_train_scaled)
X_test_pc = pca.transform(X_test_scaled)
Once we have fitted the pca transformation, behind the scenes sklearn will have computed the explained_variance ratio. We can then also calculate the accumulated explained variance using np.cumsum()
.
explained_var = pca.explained_variance_ratio_
accum_var = np.cumsum(explained_var)
Once calculated we can make a nice two y-axis plot.
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
ax1.plot(np.arange(0, explained_var.shape[0]), explained_var, '-ob', alpha = 0.5)
ax2.plot(np.arange(0, explained_var.shape[0]), accum_var, '-o',c='lightcoral', alpha = 0.5)
ax1.set_xlabel('Principal Component')
ax1.set_ylabel('Explained Variance', color='b')
ax2.set_ylabel('Accumulated Variance', color='lightcoral')
Looking at High Dimension data :)
fig, ax = plt.subplots(figsize=(6, 4))
unique_label = np.unique(y)
num_labels = len(unique_labels)
plt.scatter(X_train_pc[:,0], X_train_pc[:,1], c=y_train, alpha = 0.7, cmap='coolwarm')
cbar = plt.colorbar()
cbar.solids.set_edgecolor("face")
ax.legend()
ax.set_xlabel('Principal Component 0')
ax.set_ylabel('Principal Component 1')
Build a Predictive Model
So here we are going to build a predictive model using Gradient Boosted Classifier. These are generally the best simplest multi-class classifiers to use out the box. They are my go to.
from sklearn.ensemble import GradientBoostingClassifier
perf = []
num_pca = []
for i in np.arange(1,11):
clf = GradientBoostingClassifier(n_estimators=2000, learning_rate=1.0, max_depth=2, random_state=0).fit(X_train_pc[:,:i], y_train)
num_pca.append(i)
perf.append(clf.score(X_test_pc[:,:i], y_test))
Let's pull out another double axis plot.
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
ax1.plot(num_pca, perf, '-ob', alpha = 0.5)
ax2.plot(num_pca, (1.0 - perf/perf[-1])*100., '-o',c='lightcoral', alpha = 0.5)
ax1.set_xlabel('Principal Component')
ax1.set_ylabel('Score', color='b')
ax2.set_ylabel('Loss in Performance', color='lightcoral')
"Score" is a harsh metric for performance for the algorithm. So probably more important look at the relative metrics. We see that even with just 2 principal components we can obtain model with a score with 10% of the full input model.