Correlation & PPS Walkthrough

📂 Resources

Download the resources for this lesson here.

In this walkthrough we walk through methods for finding linear correlations and exploring nonlinear relationships with in new data sets. Look at a range of reall datasets as well as synthetics examples to demonstrate the points.

Adding libaries

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

The first data set - what is the body fat?

In this first data set we look at just over 250 measurements of males, for which we have lots of different body measurements as well as the results of a body fat measurement. Can we build a good predictive model between the measurements (which are easy to do at home) with a percentage of body fat (more complex to measure).

As a starting point, can we we understanding the correlations in our data, which will guide us in building a good linear model using our data.

Let us first look at the data

df = pd.read_csv('bodyfat.csv')

df.sample(5)

First we are going to flyby and show how we calculate the Pearson and Spearman correlations in our dataset. For this we use the well known package scipy.stats, which makes this easy.

from scipy.stats import pearsonr, spearmanr

corr_p, pVal_p = pearsonr(x=df['Abdomen'], y=df['BodyFat'])

corr_s, pVal_s = spearmanr(df['Abdomen'], df['BodyFat'])

print(pearsonr(x=df['Abdomen'], y=df['BodyFat']))
print(spearmanr(df['Abdomen'], df['BodyFat']))

Here not only do we spit out correlations but we get a p - value. A p-value is about how confident we are in a correlation, it gives the probability that a random dataset would display the same correlations. Here we can see very strong correlations, where we are confident that they are inherent in the data we are observing not simple random from sampling.

Plotting a Correlation Matrix

pandas provides a really nice function to be able to calculate all the correlations across variables, combined with this we can use plotting functions in seaborn to make really nice visuals of correlations matrices.

sns.set(rc={'figure.figsize':(11.7,8.27)})

corr = df.corr()

sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, cmap='RdBu', annot = True)

Use the correlations to build a Linear Model

Here we our discovery of correlations can be used to help build a simple model. We see a strong correlation between Abdomen measurement and Body Fat.

from sklearn.linear_model import LinearRegression

x = df['Abdomen'].tolist()
y = df['BodyFat'].tolist()

linearModel = LinearRegression().fit(np.array(x).reshape(-1,1), np.array(y).reshape(-1,1))

x_grid = np.linspace(np.min(x), np.max(x), 2)

plt.scatter(x, y, marker='o', c = 'g', alpha = 0.4)

plt.plot(x_grid, linearModel.coef_[0]*x_grid + linearModel.intercept_[0], '-b', alpha = 0.4)

plt.show()

When do correlations not work!

Ok so let's look at an example which has zero correlation! Here is a model

y = x^2 + \epsilon

where $\epsilon$ is noise. Let's look at this data.

np.random.seed(123)

N = 500

X = -1. + 2.0 * np.random.uniform(size=(N,))
y = X**2 + np.random.normal(0.0, 0.05, size=X.shape)

plt.scatter(X,y, marker='o', c='g', alpha = 0.2)
plt.show()

Ok, let's test these correlations. We there is nothing there but there is clearly a relationship!

print(spearmanr(X,y))
print(pearsonr(X,y))

where the output is

SpearmanrResult(correlation=-0.008570914283657133, pvalue=0.8483882540669414)
PearsonRResult(statistic=0.018690496801745102, pvalue=0.6767350220071778)

This data has no linear correlation. We have talked about this in the explainer.

Predictive Power Score - PP score

So here we apply predictive power score - which we have talked about in explainers. If you haven't install it then you can do this using pip install -U ppscore.

To use PP score we need to stuff our data back in a dataframe.

import ppscore as pps

df = pd.DataFrame()
df["X"] = X
df["y"] = y

We can now simple call ppscore

pps.matrix(df)

So look at this in a nice way we can then plot the results of the output as well. So let's do that.

matrix_df = pps.matrix(df)[['x', 'y', 'ppscore']].pivot(columns='x', index='y', values='ppscore')
sns.heatmap(matrix_df, vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)

So here, we first say the clear thing, the diagonal are 1. This is because if you have a variable say $x$ then you can perfectly predict itself, clear.

Here we see that given $y$ we can not predict $x$ this is because for any $y$ in our data there is a both a positive and negative value of $x$ which would make a possible output. So in summary the predict power of $y$ for variable $x$ is zero.

Alternatively, we see that the predictive power of $x$ to predict $y$ . Here we see a pp score of 0.78.

Applying it to the Body Fat dataset

So let's do that

df = pd.read_csv('bodyfat.csv')


sns.set(rc={'figure.figsize':(11.7,8.27)})

matrix_df = pps.matrix(df)[['x', 'y', 'ppscore']].pivot(columns='x', index='y', values='ppscore')
sns.heatmap(matrix_df, vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)

We see that pp-score pulls out a much simplier picture for building a model than correlations. For body fat we see that the strongest predictor s are Abdomen and Chest. We also see that the Density has a huge predictive power, this is actually false since you need density to calculate body fat, so they are not independent variables.

General Linear Models

8. Correlation & PPS Walkthrough