Linear Least Squares Breakdown and Bayesian Linear Regression

📑 Learning Objectives

Identify ill-conditioned systems of equations.
Use a ridge regressor to regularise the linear least squares solution.
Find the Bayesian posterior distribution for a linear model with a Gaussian prior and likelihood.

Linear Least Squares Breakdown

A Motivating Example

Consider the system of equations

\bf Ax = b

with

{\bf A} = \begin{bmatrix}0.16 & 0.10 \\ 0.17 & 0.11 \\ 2.02 & 1.29\end{bmatrix} \quad \text{and} \quad {\bf b} = \begin{bmatrix}0.27 \\ 0.25 \\ 3.33\end{bmatrix}

This is a minimal example from my favourite book on inverse problems, Hansen (2010). While this is a contrived example, it illustrates an very common issue with least squares estimates.

The least squares estimator for this equation is

\hat{\bf x} = ({\bf A}^T {\bf A})^{-1} {\bf A}^T {\bf b} = \begin{bmatrix} 7.021 \\ -8.40 \end{bmatrix}.

Now that looks fine at first glance, but what if I told you that the right hand side $b$ was noisy?

Our vector of measurements $b$ had been perturbed by some noise $\varepsilon$ , i.e. ${\bf b} = {\bf b}_{true} + \varepsilon$ , with

\varepsilon = \begin{bmatrix}0.01 \\ -0.03 \\ 0.02\end{bmatrix}.

This is not a huge amount of noise, but the least squares solution to ${\bf Ax} = {\bf b}_{true}$ is in fact ${\bf \hat{x}}_{true} = \begin{bmatrix} 1 \\ 1 \end{bmatrix}$ ! We can modify the original equation a little to take the noise into account:

\bf Ax + \varepsilon = b

with $\varepsilon \sim \mathcal N(0, \sigma)$ , that is the signal $\bf b$ includes additive white Gaussian noise.

In this example, I knew exactly how much noise was added, so I could recover the true coefficients, but in reality you will not know the exact measurement errors. If you did, they wouldn't be errors! In the best case scenario, you will know the approximate variance of the noise.

How did this precariuous situation come to be? Well, as it turns out, $\bf A$ is what is commonly referred to as ill-conditioned, meaning that (certain) large pertubations in the input $\bf x$ do not change the output $\bf b$ much. And conversely, small pertubations in the measurements can lead to large pertubatons in the estimated coefficients.

Regularisation

A common approach to alleviate the above problem is ridge regression, also sometimes referred to as Tikhonov regularisation. The idea here is to add a shift to the diagonal of the moment matrix ${\bf A}^T {\bf A}$ before inverting it, leading to the ridge estimator

\hat{\bf x} = ({\bf A}^T {\bf A} + \lambda {\bf I})^{-1} {\bf A}^T {\bf b}

where $\lambda$ is the ridge parameter, which determines the "stength" of regularisation. For example, for $\lambda = 0.001$ , we get $\hat{\bf x}_{\lambda=0.001} = \begin{bmatrix} 1.20 \\ 0.70 \end{bmatrix}$ , which is much closer to the true parameters than the naïve least squares estimator.

This is equivalent of solving the regularised least squares problem

\hat{\bf x} = \underset{\bf x}{\text{min}} \; \lVert{\bf b} - {\bf Ax}\rVert_2^2 + \lambda \lVert{\bf x}\rVert_2^2.

That is all fine and well, but it leaves an open question of how to choose $\lambda$ ? In practice, this is often done heuristically, and according to certain well established criteria, as one can study in much more detail in e.g. Hansen (2010). However, these methods still produce a single estimate, rather than a distribution of possible estimates. This is why we will now give this problem the Bayesian treatment.

A Bayesian Approach

To keep this section simple, we are going to make a few relatively strong assumptions. First, we will assume that the noise comes from a zero-mean Gaussian with known variance $\sigma_\varepsilon^2$ , $\varepsilon \sim \mathcal N({\bf 0}, \sigma_\varepsilon^2 {\bf I})$ . That allows is to write the likelihood as $p({\bf b}|{\bf x}) = \mathcal N({\bf b} - {\bf Ax}, \sigma_\varepsilon^2 {\bf I})$ . Furthermore, we assume that the prior distribution of model parameters is also Gaussian, $p({\bf x}) = \mathcal N({\bf \mu}_0, {\bf \Sigma}_0)$ . That alows us to write the posterior distribution as (you guessed it!) a Gaussian:

p({\bf x} | {\bf b}) = \mathcal N(\mu_t, {\bf \Sigma}\_t)

with

\mu*t = {\bf \Sigma}\_t \:({\bf \Sigma}\_0^{-1} \mu_0 + \sigma*\varepsilon^{-2} {\bf A}^T {\bf b});\; \text{and}\\ {\bf \Sigma}_t^{-1} = {\bf \Sigma}\_0^{-1} + \sigma_\varepsilon^{-2} {\bf A}^T{\bf A}

Please refer to e.g. Bishop (2006) for more details. For example, if we for the problem outlined above choose $p({\bf x}) = \mathcal N({\bf 0}, {\bf I})$ , and $\sigma_\varepsilon^2 = 0.01$ , we get

{\bf \mu_t} = \begin{bmatrix}1.17 \\ 0.74 \end{bmatrix} \quad \text{and} \quad {\bf \Sigma}\_t = \begin{bmatrix}0.29 & -0.45 \\ -0.45 & 0.71\end{bmatrix}.

References

Hansen, Per Christian. Discrete Inverse Problems: Insight and Algorithms. Fundamentals of Algorithms. Philadelphia: Society for Industrial and Applied Mathematics, 2010.

Bishop, Christopher M. Pattern Recognition and Machine Learning. Information Science and Statistics. New York: Springer, 2006.

Bayesian Linear Regression and parameter Estimation

5. Linear Least Squares Breakdown and Bayesian Linear Regression