by Prof Tim Dodwell
General Linear Models
8. Correlation & PPS Walkthrough
Download the resources for this lesson here.
In this walkthrough we walk through methods for finding linear correlations and exploring nonlinear relationships with in new data sets. Look at a range of reall datasets as well as synthetics examples to demonstrate the points.
Adding libaries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
The first data set - what is the body fat?
In this first data set we look at just over 250 measurements of males, for which we have lots of different body measurements as well as the results of a body fat measurement. Can we build a good predictive model between the measurements (which are easy to do at home) with a percentage of body fat (more complex to measure).
As a starting point, can we we understanding the correlations in our data, which will guide us in building a good linear model using our data.
Let us first look at the data
df = pd.read_csv('bodyfat.csv')
df.sample(5)
First we are going to flyby and show how we calculate the Pearson and Spearman correlations in our dataset. For this we use the well known package scipy.stats
, which makes this easy.
from scipy.stats import pearsonr, spearmanr
corr_p, pVal_p = pearsonr(x=df['Abdomen'], y=df['BodyFat'])
corr_s, pVal_s = spearmanr(df['Abdomen'], df['BodyFat'])
print(pearsonr(x=df['Abdomen'], y=df['BodyFat']))
print(spearmanr(df['Abdomen'], df['BodyFat']))
Here not only do we spit out correlations but we get a p - value. A p-value is about how confident we are in a correlation, it gives the probability that a random dataset would display the same correlations. Here we can see very strong correlations, where we are confident that they are inherent in the data we are observing not simple random from sampling.
Plotting a Correlation Matrix
pandas
provides a really nice function to be able to calculate all the correlations across variables, combined with this we can use plotting functions in seaborn
to make really nice visuals of correlations matrices.
sns.set(rc={'figure.figsize':(11.7,8.27)})
corr = df.corr()
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, cmap='RdBu', annot = True)
Use the correlations to build a Linear Model
Here we our discovery of correlations can be used to help build a simple model. We see a strong correlation between Abdomen measurement and Body Fat.
from sklearn.linear_model import LinearRegression
x = df['Abdomen'].tolist()
y = df['BodyFat'].tolist()
linearModel = LinearRegression().fit(np.array(x).reshape(-1,1), np.array(y).reshape(-1,1))
x_grid = np.linspace(np.min(x), np.max(x), 2)
plt.scatter(x, y, marker='o', c = 'g', alpha = 0.4)
plt.plot(x_grid, linearModel.coef_[0]*x_grid + linearModel.intercept_[0], '-b', alpha = 0.4)
plt.show()
When do correlations not work!
Ok so let's look at an example which has zero correlation! Here is a model
where is noise. Let's look at this data.
np.random.seed(123)
N = 500
X = -1. + 2.0 * np.random.uniform(size=(N,))
y = X**2 + np.random.normal(0.0, 0.05, size=X.shape)
plt.scatter(X,y, marker='o', c='g', alpha = 0.2)
plt.show()
Ok, let's test these correlations. We there is nothing there but there is clearly a relationship!
print(spearmanr(X,y))
print(pearsonr(X,y))
where the output is
SpearmanrResult(correlation=-0.008570914283657133, pvalue=0.8483882540669414)
PearsonRResult(statistic=0.018690496801745102, pvalue=0.6767350220071778)
This data has no linear correlation. We have talked about this in the explainer.
Predictive Power Score - PP score
So here we apply predictive power score - which we have talked about in explainers. If you haven't install it then you can do this using pip install -U ppscore
.
To use PP score we need to stuff our data back in a dataframe.
import ppscore as pps
df = pd.DataFrame()
df["X"] = X
df["y"] = y
We can now simple call ppscore
pps.matrix(df)
So look at this in a nice way we can then plot the results of the output as well. So let's do that.
matrix_df = pps.matrix(df)[['x', 'y', 'ppscore']].pivot(columns='x', index='y', values='ppscore')
sns.heatmap(matrix_df, vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)
So here, we first say the clear thing, the diagonal are 1. This is because if you have a variable say then you can perfectly predict itself, clear.
Here we see that given we can not predict this is because for any in our data there is a both a positive and negative value of which would make a possible output. So in summary the predict power of for variable is zero.
Alternatively, we see that the predictive power of to predict . Here we see a pp score of 0.78.
Applying it to the Body Fat dataset
So let's do that
df = pd.read_csv('bodyfat.csv')
sns.set(rc={'figure.figsize':(11.7,8.27)})
matrix_df = pps.matrix(df)[['x', 'y', 'ppscore']].pivot(columns='x', index='y', values='ppscore')
sns.heatmap(matrix_df, vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)
We see that pp-score pulls out a much simplier picture for building a model than correlations. For body fat we see that the strongest predictor s are Abdomen and Chest. We also see that the Density has a huge predictive power, this is actually false since you need density to calculate body fat, so they are not independent variables.