Exploring Simple Relationships

Correlation is a statistical measure of the dependencey, casual or not, between two variables. In this explainer we consider four key points:

The Pearson's Correlation Coefficient.
The Spearman's (or Rank) Correlation Coefficient.
How to show and interpret correlation plots in many variable problems.
Understand the key limitations in expressing general correlations within real world data.

Whilst in general, correlation may mean any type association between variables; it typical refers to the how linearly dependent two variables are. When people talk about the correlation coefficient between variables, they typically mean the Pearson's correlation coefficient (we will talk about others below though). The formulate for Pearon's correlations coefficient is

\rho_{X,Y} = \text{Corr}(X,Y) = \frac{\text{Cov}(X,Y)}{\sigma_X\sigma_Y} = \frac{\mathbb E [(X - \mu_X)(Y - \mu_Y)]}{\sigma_X\sigma_Y}.

When this comes down to calculating it in practice, given two sets of variables $X = \{x_0, x_1, \ldots, x_N \}$ and $Y = \{y_0, y_1, \ldots, y_N \}$ , Pearson's correlation can be computed

\tilde{\rho}_{X,Y} = \frac{\sum_{j=0}^N(x_j - \bar{x})(y_j - \bar{y})}{\sqrt{\sum_{j=0}^N(x_i - \bar{x})^2\sum_{j=0}^N(y_i - \bar{y})^2}}

The correlation coefficient which ranges between $-1$ and $+1$ .

The correlation coefficient is $+1$ if two variables have a perfect positive / increasing linear relationship.
The correlation coefficient is $-1$ if two variables have a perfect negative / decreasing linear relationship.
If the variables are independent, Pearson's correlation coefficient is $0$ , but the converse is not true. This is because the correlation coefficient detects only linear dependencies between two variables. Here are ?? examples, where the correlation is $0$ , but there is clearly a nonlinear relationship between variables in each case.

In practise correlation coefficients will be estimated by a finitie set of samples. Therefore the means $\mu_X$ and $\mu_Y$ , and the expectations $\mathbb E[\cdot]$ will be replaced by sample means, and standard deviations computed using unbiased estimates. For more information on the later - have a look at - unbiased estimates of standard deviation of wiki

Rank Correlation Coefficients

Spearman's rank coefficient, is a different measure of the dependency between two variables. The measure is nonparametric measure, this means that it does not assume any parameterised distribution (i.e. normal) or it does not impose a particular type of relationship between variables.

Spearman's rank coefficient is calculated based on the ranks of the values rather than the raw data. The coefficient can range from -1 to 1, with -1 indicating a strong negative relationship, 1 indicating a strong positive relationship, and 0 indicating no relationship.

To calculate Spearman's rank coefficient, the ranks of the values in each variable are determined, and the difference between the ranks is calculated for each pair of values. The differences are then squared, and the sum of the squares is divided by the number of pairs. The resulting value is then used to calculate the coefficient using the following formula:

\rho = 1 - \frac{6 \sum_{j=1}^n (r_X^{(j)} - r_Y^{(j)})^2}{ n(n^2 - 1)}

where $n$ is the sample size, and $r_X^{(j)}$ and $r_Y^{(j)}$ , is the ranking of the $j^{th}$ component of the datasets $X$ and $Y$ .

Spearman's rank coefficient is often used when the assumptions of Pearson's correlation coefficient, which is based on linear relationships, are not met.

Predictive Body Fat Example

For this example we use a real world data set - called the 'Body Fat Prediction Dataset'. It can be found to download from the kaggle website - here!

The Body Fat Prediction Dataset is a data set of measurements of 252 men's body fat, along with $15$ other measurements, e.g. neck, chest and wait circumference. It is a really good data set to illustrate multiple regression techniques. We use it in a number of our examples.

The downloaded csv (bodyfat.csv) is then loaded into a pandas dataframe. Columns for 252 measurements of Abdomen and BodyFat can be convert to lists $x$ and $y$ respectively as follows

from scipy.stats import pearsonr
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression

df = pd.read_csv('data/bodyfat.csv')

# Call the scipy function for pearson correlation
corr, pVal = pearsonr(x= df['Abdomen'], y=df['BodyFat'])

# Also make a linear fit using sklearn's function Linear Regression

x = df['Abdomen'].tolist()
y = df['BodyFat'].tolist()

model = LinearRegression().fit(np.array(x).reshape(-1, 1), np.array(y).reshape(-1, 1))

# Plot a Scatter & Linear of Best Fit

plt.scatter(x, y, marker = 'o', c = 'g', alpha=0.3)

plt.plot(np.linspace(min(x),max(x),2), model.coef_[0]*x_grid + model.intercept_[0], '-b', alpha=0.3, label = 'Linear Fit')

# Lots of plotting options can be added - not not included here

plt.show()

So the output of this segment of code, with a few extra plotting commands to make it look pretty is.

So here for this dataset, Person Correlation Coefficient is $0.81$ (p-value is 0.0, see below) and coefficient of the linear fit $0.63$ . This makes it clear that the gradient of best linear fit is not the same as correlation.

The scipy function personr(), not only returns a correlation, but also a p-value. This means that under the hood, the function performs a test, with the null hypothesis that the distributions underpinning the samples are actually uncorrelated and normally distributed. The p-value gives the probability of an uncorrelated system producing the datasets that have a sample correlation as "strong" (either positive or negative) as the one compuited from this dataset.

from scipy.stats import spearmanr
corr, pVal = spearmanr(x, y)

Correlation Plot

To plot a correlation plot in Python, you can use the seaborn library. Here is an example of how you might create a correlation plot using the seaborn.heatmap() function:

import seaborn as sns
import pandas as pd

sns.set(rc={'figure.figsize':(11.7,8.27)})

# Calculate the correlation matrix
corr = df.corr()

# Plot the correlation matrix as a heatmap
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, cmap='RdBu', annot=True)

This will create a heatmap that shows the correlation between the variables x and y. The heatmap uses color to indicate the strength and direction of the correlation, with blue indicating a strong negative relationship and red indicating a strong positive relationship.

You can customize the appearance of the plot by adjusting the arguments passed to the heatmap function. For example, you can change the color map, adjust the font size, or add a title to the plot.

Conclusions : Good and Bad

Correlation is a useful tool to understanding linear dependencies in data. If variables are linearly dependent, it provides perfect information for building simple predictive models, or doing engineer dimension reduction (i.e. elimination of dependent variables).

Pearson's correlation coefficients are based on the assumption of linear relationship, it clearly will fall over where there are nonlinear correlations between two variables.

Below is a series of engineered examples which demonstrate relationships, each of which have zero Pearson's correlation coefficient.

For such nonlinear relationships, other forms of nonlinear correlation metrics are required.

This leads us to a discussion of Predictive Power Score (PPS). PPS is an asymmetric, data-type-agnostic score that can detect linear or non-linear relationships between two variables. Which we will look at.

Want to read more have a look at this article "Are Correlations any guide to predictive value?". This is not a new challenge this article was published by Journal Royal Statistical Society in 1956 by Dr Robert Ferber!

General Linear Models

6. Exploring Simple Relationships - Correlations

Rank Correlation Coefficients

Predictive Body Fat Example

Correlation Plot

Conclusions : Good and Bad

Return to Lesson Index

Next Lesson