Dr Mikkel Lykkegaard

by Dr Mikkel Lykkegaard

Lesson

Bayesian Linear Regression and parameter Estimation

5. Linear Least Squares Breakdown and Bayesian Linear Regression

šŸ“‘ Learning Objectives
  • Identify ill-conditioned systems of equations.
  • Use a ridge regressor to regularise the linear least squares solution.
  • Find the Bayesian posterior distribution for a linear model with a Gaussian prior and likelihood.

Linear Least Squares Breakdown


A Motivating Example

Consider the system of equations

Ax=b\bf Ax = b

with

A=[0.160.100.170.112.021.29]andb=[0.270.253.33]{\bf A} = \begin{bmatrix}0.16 & 0.10 \\ 0.17 & 0.11 \\ 2.02 & 1.29\end{bmatrix} \quad \text{and} \quad {\bf b} = \begin{bmatrix}0.27 \\ 0.25 \\ 3.33\end{bmatrix}

This is a minimal example from my favourite book on inverse problems, Hansen (2010). While this is a contrived example, it illustrates an very common issue with least squares estimates.

The least squares estimator for this equation is

x^=(ATA)āˆ’1ATb=[7.021āˆ’8.40].\hat{\bf x} = ({\bf A}^T {\bf A})^{-1} {\bf A}^T {\bf b} = \begin{bmatrix} 7.021 \\ -8.40 \end{bmatrix}.

Now that looks fine at first glance, but what if I told you that the right hand side bb was noisy?

Our vector of measurements bb had been perturbed by some noise Īµ\varepsilon, i.e. b=btrue+Īµ{\bf b} = {\bf b}_{true} + \varepsilon, with

Īµ=[0.01āˆ’0.030.02].\varepsilon = \begin{bmatrix}0.01 \\ -0.03 \\ 0.02\end{bmatrix}.

This is not a huge amount of noise, but the least squares solution to Ax=btrue{\bf Ax} = {\bf b}_{true} is in fact x^true=[11]{\bf \hat{x}}_{true} = \begin{bmatrix} 1 \\ 1 \end{bmatrix}! We can modify the original equation a little to take the noise into account:

Ax+Īµ=b\bf Ax + \varepsilon = b

with Īµāˆ¼N(0,Ļƒ)\varepsilon \sim \mathcal N(0, \sigma), that is the signal b\bf b includes additive white Gaussian noise.

In this example, I knew exactly how much noise was added, so I could recover the true coefficients, but in reality you will not know the exact measurement errors. If you did, they wouldn't be errors! In the best case scenario, you will know the approximate variance of the noise.

How did this precariuous situation come to be? Well, as it turns out, A\bf A is what is commonly referred to as ill-conditioned, meaning that (certain) large pertubations in the input x\bf x do not change the output b\bf b much. And conversely, small pertubations in the measurements can lead to large pertubatons in the estimated coefficients.

Regularisation

A common approach to alleviate the above problem is ridge regression, also sometimes referred to as Tikhonov regularisation. The idea here is to add a shift to the diagonal of the moment matrix ATA{\bf A}^T {\bf A} before inverting it, leading to the ridge estimator

x^=(ATA+Ī»I)āˆ’1ATb\hat{\bf x} = ({\bf A}^T {\bf A} + \lambda {\bf I})^{-1} {\bf A}^T {\bf b}

where Ī»\lambda is the ridge parameter, which determines the "stength" of regularisation. For example, for Ī»=0.001\lambda = 0.001, we get x^Ī»=0.001=[1.200.70]\hat{\bf x}_{\lambda=0.001} = \begin{bmatrix} 1.20 \\ 0.70 \end{bmatrix}, which is much closer to the true parameters than the naĆÆve least squares estimator.

This is equivalent of solving the regularised least squares problem

x^=minxā€…ā€Šāˆ„bāˆ’Axāˆ„22+Ī»āˆ„xāˆ„22.\hat{\bf x} = \underset{\bf x}{\text{min}} \; \lVert{\bf b} - {\bf Ax}\rVert_2^2 + \lambda \lVert{\bf x}\rVert_2^2.

That is all fine and well, but it leaves an open question of how to choose Ī»\lambda? In practice, this is often done heuristically, and according to certain well established criteria, as one can study in much more detail in e.g. Hansen (2010). However, these methods still produce a single estimate, rather than a distribution of possible estimates. This is why we will now give this problem the Bayesian treatment.

A Bayesian Approach

To keep this section simple, we are going to make a few relatively strong assumptions. First, we will assume that the noise comes from a zero-mean Gaussian with known variance ĻƒĪµ2\sigma_\varepsilon^2, Īµāˆ¼N(0,ĻƒĪµ2I)\varepsilon \sim \mathcal N({\bf 0}, \sigma_\varepsilon^2 {\bf I}). That allows is to write the likelihood as p(bāˆ£x)=N(bāˆ’Ax,ĻƒĪµ2I)p({\bf b}|{\bf x}) = \mathcal N({\bf b} - {\bf Ax}, \sigma_\varepsilon^2 {\bf I}). Furthermore, we assume that the prior distribution of model parameters is also Gaussian, p(x)=N(Ī¼0,Ī£0)p({\bf x}) = \mathcal N({\bf \mu}_0, {\bf \Sigma}_0). That alows us to write the posterior distribution as (you guessed it!) a Gaussian:

p(xāˆ£b)=N(Ī¼t,Ī£_t)p({\bf x} | {\bf b}) = \mathcal N(\mu_t, {\bf \Sigma}\_t)

with

Ī¼āˆ—t=Ī£_tā€…(Ī£_0āˆ’1Ī¼0+Ļƒāˆ—Īµāˆ’2ATb);ā€…ā€ŠandĪ£tāˆ’1=Ī£_0āˆ’1+ĻƒĪµāˆ’2ATA\mu*t = {\bf \Sigma}\_t \:({\bf \Sigma}\_0^{-1} \mu_0 + \sigma*\varepsilon^{-2} {\bf A}^T {\bf b});\; \text{and}\\ {\bf \Sigma}_t^{-1} = {\bf \Sigma}\_0^{-1} + \sigma_\varepsilon^{-2} {\bf A}^T{\bf A}

Please refer to e.g. Bishop (2006) for more details. For example, if we for the problem outlined above choose p(x)=N(0,I)p({\bf x}) = \mathcal N({\bf 0}, {\bf I}), and ĻƒĪµ2=0.01\sigma_\varepsilon^2 = 0.01, we get

Ī¼t=[1.170.74]andĪ£_t=[0.29āˆ’0.45āˆ’0.450.71].{\bf \mu_t} = \begin{bmatrix}1.17 \\ 0.74 \end{bmatrix} \quad \text{and} \quad {\bf \Sigma}\_t = \begin{bmatrix}0.29 & -0.45 \\ -0.45 & 0.71\end{bmatrix}.

References

Hansen, Per Christian. Discrete Inverse Problems: Insight and Algorithms. Fundamentals of Algorithms. Philadelphia: Society for Industrial and Applied Mathematics, 2010.

Bishop, Christopher M. Pattern Recognition and Machine Learning. Information Science and Statistics. New York: Springer, 2006.