by Prof Tim Dodwell
General Linear Models
6. Exploring Simple Relationships - Correlations
Correlation is a statistical measure of the dependencey, casual or not, between two variables. In this explainer we consider four key points:
- The Pearson's Correlation Coefficient.
- The Spearman's (or Rank) Correlation Coefficient.
- How to show and interpret correlation plots in many variable problems.
- Understand the key limitations in expressing general correlations within real world data.
Whilst in general, correlation may mean any type association between variables; it typical refers to the how linearly dependent two variables are. When people talk about the correlation coefficient between variables, they typically mean the Pearson's correlation coefficient (we will talk about others below though). The formulate for Pearon's correlations coefficient is
When this comes down to calculating it in practice, given two sets of variables and , Pearson's correlation can be computed
The correlation coefficient which ranges between and .
-
The correlation coefficient is if two variables have a perfect positive / increasing linear relationship.
-
The correlation coefficient is if two variables have a perfect negative / decreasing linear relationship.
-
If the variables are independent, Pearson's correlation coefficient is , but the converse is not true. This is because the correlation coefficient detects only linear dependencies between two variables. Here are ?? examples, where the correlation is , but there is clearly a nonlinear relationship between variables in each case.
In practise correlation coefficients will be estimated by a finitie set of samples. Therefore the means and , and the expectations will be replaced by sample means, and standard deviations computed using unbiased estimates. For more information on the later - have a look at - unbiased estimates of standard deviation of wiki
Rank Correlation Coefficients
Spearman's rank coefficient, is a different measure of the dependency between two variables. The measure is nonparametric measure, this means that it does not assume any parameterised distribution (i.e. normal) or it does not impose a particular type of relationship between variables.
Spearman's rank coefficient is calculated based on the ranks of the values rather than the raw data. The coefficient can range from -1 to 1, with -1 indicating a strong negative relationship, 1 indicating a strong positive relationship, and 0 indicating no relationship.
To calculate Spearman's rank coefficient, the ranks of the values in each variable are determined, and the difference between the ranks is calculated for each pair of values. The differences are then squared, and the sum of the squares is divided by the number of pairs. The resulting value is then used to calculate the coefficient using the following formula:
where is the sample size, and and , is the ranking of the component of the datasets and .
Spearman's rank coefficient is often used when the assumptions of Pearson's correlation coefficient, which is based on linear relationships, are not met.
Predictive Body Fat Example
For this example we use a real world data set - called the 'Body Fat Prediction Dataset'. It can be found to download from the kaggle website - here!
The Body Fat Prediction Dataset is a data set of measurements of 252 men's body fat, along with other measurements, e.g. neck, chest and wait circumference. It is a really good data set to illustrate multiple regression techniques. We use it in a number of our examples.
The downloaded csv (bodyfat.csv
) is then loaded into a pandas
dataframe. Columns for 252 measurements of Abdomen
and BodyFat
can be convert to lists and respectively as follows
from scipy.stats import pearsonr
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
df = pd.read_csv('data/bodyfat.csv')
# Call the scipy function for pearson correlation
corr, pVal = pearsonr(x= df['Abdomen'], y=df['BodyFat'])
# Also make a linear fit using sklearn's function Linear Regression
x = df['Abdomen'].tolist()
y = df['BodyFat'].tolist()
model = LinearRegression().fit(np.array(x).reshape(-1, 1), np.array(y).reshape(-1, 1))
# Plot a Scatter & Linear of Best Fit
plt.scatter(x, y, marker = 'o', c = 'g', alpha=0.3)
plt.plot(np.linspace(min(x),max(x),2), model.coef_[0]*x_grid + model.intercept_[0], '-b', alpha=0.3, label = 'Linear Fit')
# Lots of plotting options can be added - not not included here
plt.show()
So the output of this segment of code, with a few extra plotting commands to make it look pretty is.
So here for this dataset, Person Correlation Coefficient is (p-value is 0.0, see below) and coefficient of the linear fit . This makes it clear that the gradient of best linear fit is not the same as correlation.
The scipy
function personr()
, not only returns a correlation, but also a
p-value. This means that under the hood, the function performs a test, with the
null hypothesis that the distributions underpinning the samples are actually
uncorrelated and normally distributed. The p-value gives the probability of an
uncorrelated system producing the datasets that have a sample correlation as
"strong" (either positive or negative) as the one compuited from this dataset.
from scipy.stats import spearmanr
corr, pVal = spearmanr(x, y)
Correlation Plot
To plot a correlation plot in Python, you can use the seaborn library. Here is an example of how you might create a correlation plot using the seaborn.heatmap()
function:
import seaborn as sns
import pandas as pd
sns.set(rc={'figure.figsize':(11.7,8.27)})
# Calculate the correlation matrix
corr = df.corr()
# Plot the correlation matrix as a heatmap
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, cmap='RdBu', annot=True)
This will create a heatmap that shows the correlation between the variables x and y. The heatmap uses color to indicate the strength and direction of the correlation, with blue indicating a strong negative relationship and red indicating a strong positive relationship.
You can customize the appearance of the plot by adjusting the arguments passed to the heatmap function. For example, you can change the color map, adjust the font size, or add a title to the plot.
Conclusions : Good and Bad
Correlation is a useful tool to understanding linear dependencies in data. If variables are linearly dependent, it provides perfect information for building simple predictive models, or doing engineer dimension reduction (i.e. elimination of dependent variables).
Pearson's correlation coefficients are based on the assumption of linear relationship, it clearly will fall over where there are nonlinear correlations between two variables.
Below is a series of engineered examples which demonstrate relationships, each of which have zero Pearson's correlation coefficient.
For such nonlinear relationships, other forms of nonlinear correlation metrics are required.
This leads us to a discussion of Predictive Power Score (PPS). PPS is an asymmetric, data-type-agnostic score that can detect linear or non-linear relationships between two variables. Which we will look at.
Want to read more have a look at this article "Are Correlations any guide to predictive value?". This is not a new challenge this article was published by Journal Royal Statistical Society in 1956 by Dr Robert Ferber!