by Dr Ana Rojo-Echeburúa
Updated 28 February 2024
The Kernel Cookbook
Download the resources for this post here.
Kernels play a very important role in determining the characteristics of both the prior and posterior distributions of the Gaussian Process.
💭 But with many of them and countless variations and combinations, how do we choose the right one?
🧠 Key Learnings
In this tutorial, we'll take a step-by-step approach to understand kernels and covariance matrices.
We'll begin by laying the groundwork, explaining what kernels are and how they relate to covariance matrices. Then, we'll dive into practical applications with the Kernel Cookbook, exploring different kernel types and their uses.
We'll explore how to combine kernels through multiplication and addition, and we'll discuss stationary and non-stationary kernels when modelling different data patterns.
Finally, we'll touch on AutoML in TwinLab, examining how kernels are selected automatically.
Recap from previous tutorial
In part one of the series, we introduced Gaussian Processes (GPs), a powerful framework for probabilistic modelling.
We discussed how GPs provide a non-parametric approach to regression and classification tasks. Unlike traditional parametric models, GPs offer a more flexible alternative. Rather than specifying a fixed set of parameters, GPs define a distribution over functions, enabling them to adapt to the complexity of the data.
One of the key advantages of GPs lies in their ability to model non-linear relationships without explicitly defining the functional form of the underlying process. This makes GPs particularly well-suited for tasks where the relationship between inputs and outputs is complex or unknown.
GPs provide not only point predictions but also measures of uncertainty associated with those predictions. This uncertainty quantification is crucial in decision-making processes, allowing users to make informed choices based on the reliability of the model predictions.
We also illustrated how GPs capture correlations within data.
→ Consider temperature as an example: it behaves like a spreading wave, where nearby temperatures are similar because of heat flow, causing them to gradually become less similar as you move away.
This correlation structure can be described by the radial basis function kernel that we briefly covered in the last tutorial, which exhibits perfect correlation locally and decays with a length-scale.
Figure 1.
Sample functions drawn from the posterior distribution of a Gaussian Process using a RBF kernel.
→ Another example is tides: which occur every 12 hours, leading to strong periodic correlations. The height of water in a river, for instance, correlates strongly every 12 hours, exhibiting oscillatory behaviour.
This can be described by the periodic kernel.
Figure 2.
Sample functions drawn from the posterior distribution of a Gaussian Process using a Periodic kernel.
In GP modelling, we aim to capture these correlation processes. We can use different kernel functions to model various correlation structures, such as linear, decaying, or periodic ones. These functions can be combined, added, or multiplied to create models with diverse correlation structures.
The heart of good GP modelling lies in understanding and selecting appropriate kernel functions to capture the underlying patterns in the data effectively.
That’s why this "Kernel Cook Book" is important - it is a comprehensive guide to understanding different types of kernels and their applications.
This will allow you to choose the most appropriate kernel for your specific problem and to model complex relationships and patterns in the data more effectively.
The Kernel and the Covariance Matrix
A kernel, also known as a covariance function or similarity function, measures how similar pairs of input data points are to each other.
More formally, a kernel specifies the covariance between the function values at two input points and .
💭 But how do we compute the covariance matrix?
Let’s start with a simple example before we give a mathematical definition.
Let’s start with two points: and and we are going to consider this kernel: where .
→ This the Radial Basis Function kernel with its hyperparameters set to 1.
Let’s calculate , , and . |
---|
The covariance matrix in this case is constructed as follows:
→ This is the covariance matrix for our simple example with two data points and using a Radial Basis Function kernel where the hyperparameters are set to .
Let’s have a look now at a formal definition:
Given a set of input data points , we can use a kernel function to construct a covariance matrix .
The covariance matrix captures the pairwise covariances between all pairs of data points in , in other words, the covariance matrix shows how much each pair of data points in is related to each other.
The elements of the covariance matrix are computed using the kernel function for each pair of data points and in .
Specifically, represents the covariance between the function values at and .
In matrix notation, the covariance matrix is defined as:
Covariance Matrix Properties
The covariance matrix has several important properties:
- Symmetry: The covariance matrix is symmetric.
- Positive semidefinite: The covariance matrix is positive semidefinite, which means that all of its eigenvalues are non-negative. If you don’t know what eigenvalues are, don’t worry - you just need to know that they are special numbers related to matrices and that this property makes sure the covariance matrix behaves nicely and tells us valid things about how variables relate to each other.
- Inverse covariance matrix: If the covariance matrix can be inverted (meaning it's not singular or that the determinant is not zero), its inverse is called the precision matrix or inverse covariance matrix. This precision matrix is really important in GPs because it tells us about the relationships between variables and helps us make predictions.
- Determinant: The determinant of the covariance matrix gives us an idea of how spread out the data points are. A bigger determinant means the data is more spread out, like if you have a lot of variation. A smaller determinant means the data is more clustered together, like if it's more uniform or tightly grouped.
💡The covariance matrix encodes the correlations between all pairs of data points, enabling the Gaussian Process to capture the underlying patterns and uncertainties in the data. Understanding kernels and their relationship to covariance matrices is fundamental in Gaussian Processes. Kernels define the similarity between data points, while covariance matrices encapsulate these similarities and correlations, providing the foundation for Gaussian Process modelling.
Connection between Covariance Definition and Function Space
In our previous tutorial we talked about two ways of understanding GPs - as a way of modelling correlation and distributions over functions.
Defining a covariance function allows us to sample functions with the desired correlation structure. However, many find it challenging to bridge the gap between the definition of covariance and the family of functions that have that structure.
→ This is where kernels come into play.
Kernels serve as the mathematical bridge, linking the covariance definition to the function space.
When you pick a kernel, you're not just deciding how data points are related; you're also setting up the characteristics of the functions you'll get. For example, the Radial Basis Function kernel that we covered in the previous tutorial, gives you really smooth functions. But with other kernels, you might get functions that aren't as smooth or continuous. So, when choosing a kernel, you have to think about more than just how the data points are connected—you also consider how smooth or continuous you want the functions to be.
Kernel Cookbook
Linear kernels
Linear kernels are one of the simplest yet effective types of kernels used in GPs.
Concept and properties
A linear kernel, also known as the dot product kernel, measures how similar two data points are by calculating dot product. In other words, it checks how much they align with each other when you multiply their values together.
Mathematically, the linear kernel is defined as:
where means transpose - that’s how you multiply vectors!
Figure 3.
Sample functions drawn from the prior distribution of a Gaussian Process using a Linear kernel.
Example
Let's compute the covariance matrix using a linear kernel for the input vectors and .
Using the formula of the covariance matrix :
The diagonal terms are just the norms of the two data inputs:
The covariance matrix is symmetric, so:
💡 This is the result of computing the covariance matrix using a linear kernel for the given vectors.
Applications
In linear regression tasks, linear kernels can be employed to capture linear dependencies between features and predict the target variable. In this case, we want to understand how input data relates to a continuous target variable - the output that we want to predict.
For example, let's say we're trying to predict house prices based on features like size and number of bedrooms. A linear kernel could help us see how each feature (size, number of bedrooms) affects the price. If the size of a house increases, does the price also go up? Linear kernels help us find these kinds of relationships.
Advantages
- Interpretability: Linear kernels produce models that are easy to interpret since the relationship between inputs and outputs is linear. This makes it straightforward to understand the contribution of each characteristic of the data to the predictions.
- Computationally efficient: Training Gaussian Processes with linear kernels tends to be computationally efficient compared to more complex kernel functions. This efficiency makes linear kernels suitable for large-scale datasets.
Limitations
- Limited expressiveness: Linear kernels can only capture linear relationships between features and outputs. They may not be suitable for datasets with complex, non-linear patterns, as they cannot model non-linear dependencies.
- Underfitting: In cases where the true relationship between inputs and outputs is non-linear, using a linear kernel may lead to underfitting- when a model has not learned the patterns in the training data well and is unable to generalise well on the new data - resulting in poor predictive performance.
Figure 4.
Surface representation of the Linear kernel.
💡 Linear kernels provide a simple yet effective way to model linear relationships in GPs. While they offer advantages such as interpretability and computational efficiency, they may not be suitable for capturing complex, non-linear patterns in the data.
Polynomial kernels
Polynomial kernels are a popular choice for capturing non-linear relationships in Gaussian Processes.
Concept and properties
Polynomial kernels compute how similar two data points are by computing the dot product of their input data vectors to a certain power , where is the degree of the polynomial.
Mathematically, the polynomial kernel between two vectors and is defined as:
Here, is an optional coefficient, and is the degree of the polynomial.
- : The degree of the polynomial determines the complexity of the decision boundary or the curvature of the function learned by the Gaussian Process. Higher degrees allow for more complex interactions between features but also increase the risk of overfitting - essentially, the model learns the training data too well, to the point where it doesn't generalise well to new, unseen data.
- : The coefficient is like a volume knob—it adjusts how loud the higher-order terms in the polynomial are compared to the lower-order ones. So, if is large, the higher-degree interactions have a big influence on the similarity measure. But if is small, they don't have as much impact.
Figure 5.
Sample functions drawn from the prior distribution of a Gaussian Process using a Polynomial kernel.
Advantages
- Polynomial kernels offer flexibility in capturing non-linear relationships without the computational cost of more complex kernel functions.
- They can model a wide range of non-linear patterns in the data, making them suitable for diverse machine learning tasks.
Limitations
- Choosing the appropriate degree and coefficient for the polynomial kernel can be challenging and may require tuning through cross-validation - a technique that helps you find the best combination of parameters for your specific dataset and problem and get the best performance of your model.
- Higher-degree polynomial kernels can lead to overfitting.
Figure 88.
Surface representation of the Polynomial kernel.
💡 Polynomial kernels are effective when modelling non-linear relationships in GPs. You can tweak the degree and coefficient to make the model more or less complex, matching it better to the specifics of your data.
Radial Basis Function (RBF) Kernels
Radial Basis Function (RBF) kernels, also known as Gaussian kernels, are widely used in Gaussian Processes due to their remarkable properties.
Concept and properties
The RBF kernel, computes how similar two data points are based on the Euclidean distance between them, aka, the length of the shortest path between two points in a straight line. You can think of it like the distance you would travel if you could fly directly from one point to another, without any obstacles in your way.
Mathematically, the RBF kernel between two vectors and is defined as:
Here, represents the Euclidean distance between the two vectors and can be calculated by the formula
Too complicated?
Let's compute the Euclidean distance for and .
Now back to the RBF kernel.
- : is the lengthscale, which determines the length of the 'wiggles' in your function
- : is the output variance, a hyperparameter that controls the width of the kernel and determines the average distance of your function away from its mean.
This kernel is characterised by its smoothness and infinite support (it means that the kernel considers all possible data points, no matter how far away they are from each other), making it suitable for modelling functions that have smooth and continuous behaviour.
Figure 7.
Sample functions drawn from the prior distribution of a Gaussian Process using a RBF kernel.
Smooth function modelling
Due to its ability to capture smoothness, the RBF kernel is well-suited for modelling functions that exhibit smooth and continuous behaviour. It works particularly well at the following:
- Function interpolation: RBF kernels excel in tasks where the underlying function is smooth and continuous, such as function interpolation in signal processing or spatial modelling in geostatistics.
- Time-Series forecasting: In time-series forecasting, where the data exhibits temporal dependencies and smooth trends, RBF kernels can effectively model the underlying dynamics and make accurate predictions.
- Regression and classification: RBF kernels are commonly used in regression and classification tasks where the relationship between inputs and outputs is non-linear and smooth. They can capture complex patterns in the data and provide robust predictions.
Advantages
- RBF kernels offer a flexible and versatile approach to modelling non-linear relationships in Gaussian Processes.
- They can capture smooth and continuous functions, making them suitable for a wide range of applications.
Limitations
- Choosing the appropriate for the RBF kernel can be challenging and may require careful tuning through cross-validation.
- RBF kernels may struggle with modelling discontinuous or highly oscillatory functions, as they prioritise smoothness in the predictions.
One last thought on the RBF kernel
Many people use this method when setting up a GP regression or classification model. It's a quick-and-easy solution that usually works well for smoothly curving functions, especially when the number of data points is a multiple of the number of dimensions.
However, if your function has sudden jumps or sharp corners or if it's not smooth in its first few derivatives, this method might not work as expected. It can lead to strange effects in the predictions, like the mean becoming zero everywhere or having strange wavy patterns. Even if your function is mostly smooth, this method might still struggle if there are small areas of non-smoothness in the data. This problem can be hard to detect, especially if you have more than two dimensions of data.
One clue that something's wrong is if the length-scale chosen by the model keeps getting smaller as you add more data. This usually means the model isn't quite right for the data you have!
Figure 8.
Surface representation of the RBF kernel.
Other types of kernels
In addition to linear, polynomial, and RBF kernels, there exist several other types of kernels that offer unique properties and applications in Gaussian Processes.
Let's briefly introduce three such kernels: the Matérn kernel, periodic kernel, and spectral mixture kernel.
Matérn kernel
The Matérn kernel is a flexible class of kernels that generalises the RBF kernel. It is characterised by two hyperparameters: the length scale and the smoothness parameter .
Mathematically, the Matérn kernel is defined as:
Here, represents the distance between two data points, is the gamma function and denotes the modified Bessel function of the second kind - these are functions that describe how certain physical or mathematical systems behave - that’s all that you need to know!)
Something to consider is that closed-form solutions exist for this kernel when = 1/2, 3/2, 5/2, 7/2, and so forth. Additionally, Matérn kernels with greater than 5/2 are exceptionally smooth, and it is essentially equivalent to use the Radial Basis Function (RBF). This makes the choice of = 1/2, 3/2, and 5/2 as the most popular ones.
The Matérn kernel is a popular choice in Gaussian process regression due to its flexibility. The formula for the Matérn kernel with parameter ν and length scale σ is typically denoted as , where is the Euclidean distance between two points. The formula for the Matérn kernel is:
For , , and , the Matérn kernel takes on the following forms:
-
For (also known as the exponential kernel):
-
For :
-
For :
Figure 9.
Sample functions drawn from the prior distribution of a Gaussian Process using a Matérn kernel.
Figure 10.
Surface representation of the Matérn kernel.
💡 The Matérn kernel offers a trade-off between smoothness and computational efficiency. When adjusting the smoothness parameter , you can control the flexibility of the kernel and tailor it to the specific characteristics of the data.
Periodic kernel
The periodic kernel is designed to model periodic patterns in data, such as seasonal fluctuations in time-series data or periodic spatial patterns.
Mathematically, the periodic kernel is defined as:
Here, represents the period of the periodic pattern (distance between repetitions of the function), and is the length scale parameter, same as in the RBF kernel.
💡 The periodic kernel works well when data repeats itself over and over again, like the seasons or the phases of the moon. Imagine you're studying temperature data over the course of a year. The periodic kernel helps you see how similar temperatures are at different times of the year, taking into account that summer temperatures are similar to other summers and winter temperatures are similar to other winters.
Figure 11.
Sample functions drawn from the prior distribution of a Gaussian Process using a Periodic kernel.
Figure 12.
Surface representation of the Periodic kernel.
Spectral mixture kernel
The spectral mixture kernel is a flexible kernel that models complex functions as a weighted sum (so some of the functions can be used more or less than others) of simple sinusoidal components. A sinusoidal function is a mathematical function that oscillates or repeats in a regular, wave-like pattern.
Mathematically, the spectral mixture kernel is defined as:
Here, represents the weight of the -th component, denotes the length scale of the -th component and is the number of components.
Figure 13.
Sample functions drawn from the prior distribution of a Gaussian Process using a Spectral Mixture kernel.
Figure 14.
Surface representation of the Spectral Mixture kernel.
💡The spectral mixture kernel is like a versatile tool for understanding functions that have different kinds of wavy patterns. Combining multiple sinusoidal components with different frequencies and weights, this kernel can capture a wide range of patterns and structures in the data.
Combining kernels
The kernels discussed so far are useful when dealing with homogeneous data types.
Homogeneous data types refer to data that are of the same kind or nature. In other words, all the data points in the dataset share similar characteristics or properties.
→ For example, if you have a dataset of temperatures recorded at different times, where each data point represents a temperature value, then the data is homogeneous because it's all temperature data.
💭However, what if you have multiple types of features and want to regress on all of them together?
One common approach is to combine kernels by multiplying or adding them together.
Multiplying kernels
Multiplying together kernels is a standard method to combine two kernels, especially when they are defined on different inputs to your function. Roughly speaking, multiplying two kernels can be thought of as an AND operation.
- Linear times Periodic: Multiplying a linear kernel with a periodic kernel results in periodic functions with increasing amplitude as we move away from the origin.
- Linear times Linear: Multiplying two linear kernels results in functions that are quadratic.
- Multidimensional Products: When you multiply two kernels that each depend on only one input dimension, you're essentially combining them to create a new kernel that considers variations across both dimensions. Imagine you have two separate measurements, like temperature and humidity. If you multiply kernels that capture the similarity of temperature values and humidity values separately, the resulting kernel will consider both factors together. So, for a given function value at a point , it's expected to be similar to another function value at a point only if is close to and is close to .
Mathematically, this operation is represented by multiplying the individual kernels for each dimension.
The resulting kernel function combines the effects of both dimensions to model the similarity between function values across the data points, aka,
Adding kernels
Adding two kernels together can be thought of as an or operation, where the resulting kernel has a high value if either of the base kernels has a high value. Roughly speaking, adding two kernels can be thought of as an OR operation.
- Linear plus Periodic: Adding a linear kernel and a periodic kernel results in periodic functions with an increasing mean as we move away from the origin.
- RBF kernel plus White Noise: Combining a RBF kernel with White Noise results in a smooth function with some level of noise added. The RBF kernel captures smooth trends, while the White Noise adds random fluctuations, resulting in a function that is smooth overall but with occasional small, random variations.
- Adding across Dimensions: When you add kernels that depend only on a single input dimension, you're essentially combining them to create a new kernel that considers each dimension separately. Imagine you have two separate measurements, like temperature and humidity. If you add kernels that capture the similarity of temperature values and humidity values separately, the resulting kernel will consider both factors but treat them as independent dimensions. So, for a given function value at a point , it's expressed as the sum of two separate functions: one that captures the behaviour along the x-axis and another that captures the behaviour along the y-axis. Mathematically, this operation is represented by adding the individual kernels for each dimension.
The resulting kernel function combines the effects of both dimensions to model the behaviour of the function across the entire input space, treating each dimension independently, aka,
Stationary vs non-stationary kernels
Let’s talk now about stationary and non-stationary kernels.
A stationary kernel assumes that the correlation structure remains constant across all data points. In other words, it implies that the degree of correlation between data points does not depend on their absolute positions but only on their relative distances.
In contrast, a non-stationary kernel allows the correlation structure to vary across all data points. This means that the degree of correlation between data points can change depending on their absolute positions.
→ Consider the example of a step function: in the flat regions, points are correlated over a very long distance, while over the step itself, correlation exists only over a short distance. This makes modelling step functions challenging for standard stationary Gaussian processes, as the correlation length scale changes abruptly over short distances.
Stationary kernels | Non-stationary kernels |
---|---|
They are good for describing phenomena that behave consistentlybut they're not so great at capturing sudden changes, like big bumps where the correlation varies significantly over small distances. | They adapt to changes in the correlation, making them more suitable for modelling localised features and phenomena with varying behaviour. |
💡 Understanding the distinction between stationary and non-stationary kernels is essential for selecting appropriate models that can effectively capture the correlation structures present in the data.
Stationary kernels
- RBF kernel: This is a classic example of a stationary kernel. It assumes that the correlation between data points decreases smoothly with increasing distance, without any dependence on their absolute positions.
- Matérn kernel: The Matérn kernel is another example of a stationary kernel. It generalises the RBF kernel and includes a parameter that controls the smoothness of the correlation function. For certain values of , the Matérn kernel reduces to the RBF, retaining its stationary properties.
- Periodic kernel: The periodic kernel has no particular dependence on an input value x, it is defined by it is relative position to another point only (the distance). The covariance structure imposed by a periodic kernel is therefore stationary.
Non-Stationary kernels
- Linear kernel: The linear kernel is considered a non-stationary kernel when used in isolation - the correlation to any point with x = 0 is zero. However, when combined with other kernels, such as the periodic kernel, it can contribute to creating non-stationary correlation structures. For example, multiplying a linear kernel with a periodic kernel results in functions with periodic behaviour and increasing amplitude as we move away from the origin, indicating non-stationarity.
- Polynomial kernel: Similar to the linear kernel, the polynomial kernel can also contribute to non-stationary correlation structures when combined with other kernels. Multiplying a polynomial kernel with a periodic kernel, for instance, can lead to functions with periodic behaviour and quadratic or higher-order trends, indicating non-stationarity.
💡 It's important to note that the distinction between stationary and non-stationary kernels is not always clear-cut, and the categorisation may depend on the context and usage of the kernels. Additionally, some kernels may exhibit both stationary and non-stationary properties depending on the values of their hyperparameters or how they are combined with other kernels in composite models.
AutoML in TwinLab
Imagine if there were a tool that could automatically select the most suitable kernel for your data, sparing you the need to manually experiment with different options.
💭 Wouldn't that be fantastic?
AutoML in twinLab does exactly that!
twinLab is our cloud-based platform for applying probabilistic Machine Learning to your simulations, experiments, or sensor data. This means it adds Uncertainty Quantification to your model outputs, so it's especially useful when you need to make predictive decisions based on limited or sparse data.
One of the key features of TwinLab is its use of automated machine learning (AutoML) to determine the best kernel for your data. With TwinLab, users don't need to manually select a kernel; instead, the platform employs model selection techniques to identify the kernel that best fits the data.
But more on this in part 3 of this series! Now, time for a little recap.
Summary of key concepts
In "The Kernel Cook Book," we have explored the fundamental concepts of kernels. Kernels play a crucial role in defining the similarity between data points and encoding prior knowledge about the problem domain. We have discussed various types of kernels, including linear, polynomial, RBF, Matérn, periodic, and spectral mixture kernels, as well as how to combine these kernels, each offering unique properties and applications in Gaussian Processes.
From regression and classification tasks to time-series forecasting and spatial modelling, Gaussian Processes equipped with appropriate kernels are excellent tools for modelling complex, non-linear relationships in data. They offer flexibility, robustness, and uncertainty quantification, making them suitable for diverse machine learning and statistical modelling tasks across various domains.
⏭️ What's next?
You can continue your exploring different Kernels on the Streamlit App in the resources section by playing around with different hyperparameters.
You can also read the written tutorial, where more examples are worked out, and we dive deeper into the mathematical formulation of some of the concepts explored in this video.
To delve deeper, you can plot combinations of kernels and explore the resulting graphs.
→ In the next tutorial of this series, we'll dive into a hands-on example in Python, utilising twinLab. We'll explore how to understand the factors and features that drive outcomes, optimise sampling, and automate kernel selection.
🤖 Resources
The complete code for the Streamlit Apps, requirements.txt
file and other resources can be downloaded from the resource panel at the top of this article.
Thank you for joining me on this journey. I'm looking forward to seeing you on our next adventure. Until then, happy coding, and goodbye. 👱🏻♀️
Featured Posts
If you found this post helpful, you might enjoy some of these other news updates.
Python In Excel, What Impact Will It Have?
Exploring the likely uses and limitations of Python in Excel
Richard Warburton
Large Scale Uncertainty Quantification
Large Scale Uncertainty Quantification: UM-Bridge makes it easy!
Dr Mikkel Lykkegaard
Expanding our AI Data Assistant to use Prompt Templates and Chains
Part 2 - Using prompt templates, chains and tools to supercharge our assistant's capabilities
Dr Ana Rojo-Echeburúa