Prof Tim Dodwell

by Prof Tim Dodwell

Lesson

General Linear Models

7. Finding Nonlinear Relations - Predictive Power Score

The Predictive Power Score (PPS) can be used to detect linear and non-linear relationships between two variables. The score ranges from 0 to 1, from no predictive power to perfect predictive power. It can be used as a powerful alternative to (linear) correlation - and over comes some of the key issues the linearity assumptions which underpin correlation analysis.

In this explainer, we demonstrate how PPS can be used:

  1. How it is a PPS used as an early diagnositic tool in exploring the relationships between variables in data?
  2. How does PPS address the key limitations of assuming linearity in standard correlations measures (see . link).
  3. How it can be used in model design and inform feature engineering or model reduction in practise.

Explainer through a toy problem.


As the name suggests predictive power score gives a score (between 0 and 1) which indicates how informative one variable (xx) is in predicting the value of another yy.

A key difference between PPS and correlations is that this measure is in general non-symmetric. What do we mean by this?

Well let's consider the following relation between two parameters xx and yy,

y=x2+ϵ.y = x^2 + \epsilon.

Here ϵ\epsilon is a noise parameter defined by a parameterised Gaussian e.g. N(0,σ2)\mathcal{N}(0, \sigma^2). The plot below shows the data.

academy.digilab.co.uk

In this simple example the xx has a strong predictive power in terms of yy. It can exploit the underlying quadratic relationship. However, if I give you a value of yy the associated value of xx is ill-posed. What do I mean by this? For this simple quadratic, there are two underpinning values which are good predictions. The variable yy has poor predictive power when trying predicting xx.

The PPS has be developed in an open python library called ppscore this library can be found https://github.com/8080labs/ppscore, it's use is nice and simple.

To install we start with install the package.

pip install -U ppscore

Then from here we can do this example as follows

import pandas as pd
import numpy as np
import ppscore as pps

df = pd.DataFrame()
df["x"] = np.random.uniform(-1, 1, 1000)
df["y"] = df["x"] * df["x"] + np.random.normal(0.0, 0.05,  1000)

pps.matrix(df)

import seaborn as sns
matrix_df = pps.matrix(df)[['x', 'y', 'ppscore']].pivot(columns='x', index='y', values='ppscore')
sns.heatmap(matrix_df, vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)

The last part is the key output but which generates our matrix plot, which we can identify as the output from PPS, this looks like

academy.digilab.co.uk

So let's go through this output. We see that the diagonal is and always will be 1. A variable predicts itself perfectly (not surpise here). The bottom left hand corner says that if I know xx then I can predict yy with confidence. The 0.760.76 means it is not perfect, and this is driven by the noise. Finally as expected the predictive power of yy predict xx is 00 this is because the problem that there are two values of xx each yy. Simple as that.

How does it work under the hood?


Under the hood, the PPS algorithm uses a machine learning algorithm called XGBoost (eXtreme Gradient Boosting) to learn the mapping from the feature variable to the target variable. It therefore considers pair-wise predictive capabilities from one variable to another.

This idea of pairwise predictions is important. It therefore misses complex dependents which involve the interaction of multiple features in a non-linear way.

XGBoost is a powerful and widely used gradient boosting framework that uses decision trees as base models. Gradient boosting is an ensemble learning technique that combines multiple weak learners (in this case, decision trees) to create a strong predictor.

The XGBoost algorithm is particularly well-suited for PPS because it can handle both numerical and categorical features, it can handle missing values, and it can capture complex relationships between variables.

Applications of the PPS and the PPS matrix


After we learned about the advantages of the PPS, let’s see where we can use the PPS in the wild.

  1. Find patterns in the data: The PPS finds every relationship that the correlation finds — and more. Thus, you can use the PPS matrix as an alternative to the correlation matrix to detect and understand linear or nonlinear patterns in your data. This is possible across data types using a single score that always ranges from 0 to 1.

  2. Feature selection: In addition to your usual feature selection mechanism, you can use the predictive power score to find good predictors for your target column. Also, you can eliminate features that just add random noise. Those features sometimes still score high in feature importance metrics. In addition, you can eliminate features that can be predicted by other features because they don’t add new information. Besides, you can identify pairs of mutually predictive features in the PPS matrix — this includes strongly correlated features but will also detect non-linear relationships.

  3. Detect information leakage: Use the PPS matrix to detect information leakage between variables — even if the information leakage is mediated via other variables.

  4. Data Normalization: Find entity structures in the data via interpreting the PPS matrix as a directed graph. This might be surprising when the data contains latent structures that were previously unknown.

Back to the Body fat Prediction Dataset


Just to go back to the example we looked at in the correlation explainer, around measurements from 252 males which provided weight, age, hip, chest, adohem measurements alongside body fat measurements. Here is the correlation heatmap we had in the "correlation" explainer

academy.digilab.co.uk

As we see that correlation give measures on how well two variables are linearly dependent. Correlation is symmetric, and gives limited information if relationships are nonlinear, and with that how well one variable can predict another.

We apply ppscore to this dataset.

import ppscore as pps
import seaborn as sns
matrix_df = pps.matrix(df)[['x', 'y', 'ppscore']].pivot(columns='x', index='y', values='ppscore')
sns.heatmap(matrix_df, vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)

Here is the result

academy.digilab.co.uk
  • Clear that if I want to predict Body Fat I look along the row associated with body fat. We see that two easy measurements we could use (using a tape measure) are measuring Abdomen (x1x_1) and Chest (x2x_2).

Application 1. Spotting Data Leakage

First let's look at the "Density", we see that density has a very strong predict power, and this is symmetric. By this we mean that BodyFat and Density are both equally good predictors of each other.

On the face of it, this makes it looks like a perfect candidate for an input feature when building a machine learning model. But this would actually be a dishonest model. This is because BodyFat is actually a derived quantity from Density, and therefore in calculating one your have the other. This defeats the point of needing a simple predictive model for BodyFat, rather than taking expensive measurements.

As a general rule, inputs which have such strong (symmetric) predictive power score indicate "something is going on". Here best practise is to explore what is the relationship between these variables.

This is an example of "data leakage", where you are inplanting direct information into the inputs about the outputs without knowing about it.

We have additional examples of this throughout the course. If you want a great example of data leakage have a look at this twitter feed for some.

Application 2. Predictive Power Score as part of your Machine Learning Workflow

Using the PPS score in the example above it clear the two easily measureable features I can use (Abdomen and Chest) to build a predictive model for Body Fat. PPS can therefore be used in an early exploration phase, which can inform appropriate model selection.

Whilst Machine Learning models can work with large numbers of inputs (and outputs), the required amount of data will increase with the number of input features. It is therefore prudent to select those only which are most informative with respect out the outputs. This is where predictive power comes in.

Let's consider

y^=f(x)=w1x1+w2x2=wTx,\hat{y} = f({\bf x}) = w_1 x_1 + w_2 x_2 = {\bf w}^T{\bf x},

where x1x_1 represents Abdomen and x2x_2 Chest measurements.

Challenge for you. Can you write a bit of code to fit this Linear Model?

You can find the data set you need here . . . body fat link