Dr Mikkel Lykkegaard

by Dr Mikkel Lykkegaard

Lesson

Bayesian Linear Regression and parameter Estimation

7. Bayesian Parameter Estimation

📑 Learning Objectives
  • Formulate the general data-generating model for Bayesian parameter estimation.
  • Choose a prior and likelihood function for your Bayesian parameter estimation problem.
  • Identify algorithms for Bayesian Inference, namely Markoc Chain Monte Carlo (MCMC) and Variational Inference (VI)

Bayesian Parameter Estimation


The data-generating model

Let's first revisit the model from the previous lesson, namely

Ax+ε=b\bf Ax + \varepsilon = b

This is written in the usual ordinary linear least squares notation, but in the Bayesian formalism it usually takes the form

dobs=F(θ)+εd_{obs} = \mathcal F(\theta) + \varepsilon

where dobsd_{obs} is a vector of observations (b\bf b in the previous notation), θ\theta is a vector of model parameters (x\bf x in the previous notation) for which we seek the posterior distribution, and F\mathcal F is the so-called forward operator which describes how the model maps parameters θ\theta to noise-free observations dobsεd_{obs} - \varepsilon. This operator can be anything, including a linear map as before, e.g. F(θ)=Aθ\mathcal F(\theta) = \bf A \theta.

Another look at the likelihood and the prior

When the noise ε\varepsilon is Gaussian, we can write the likelihood as p(dobsθ)=N(dobsF(θ),σε2I)p(d_{obs}|\theta) = \mathcal N(d_{obs} - \mathcal F(\theta), \sigma_\varepsilon^2 {\bf I}), or, more generally, p(dobsθ)=N(dobsF(θ),Σε)p(d_{obs}|\theta) = \mathcal N(d_{obs} - \mathcal F(\theta), \Sigma_\varepsilon), which is equivalent of

p(dobsθ)=(2π)d/2Σε1/2exp(12(dobsF(θ))Σε1(dobsF(θ)))p(d_{obs}|\theta) = (2\pi)^{-d/2} |\Sigma_{\varepsilon}|^{-1/2} \exp \left( -\frac{1}{2}(d_{obs} - \mathcal F(\theta))^\intercal \: \Sigma_{\varepsilon}^{-1} \: (d_{obs} - \mathcal F(\theta))\right)

where Σε\Sigma_\varepsilon is the covariance matrix of the Gaussian noise. However, the noise is also not always Gaussian. For example, if we are trying to model something countable (number of cars passing a certain location, number of animals in a nature reserve, etc.) we may want to use e.g. a Poisson likelihood instead.

After correctly identifying the noise model and hence the likelihood function, the task is now to choose a prior for the model parameters θ\theta, i.e. p(θ)p(\theta) to complete Bayes' Theorem:

p(θdobs)=p(θ)p(dobsθ)p(d_obs)p(\theta|d*{obs}) = \frac{p(\theta) p(d*{obs}|\theta)}{p(d\_{obs})}

As we saw in the previous session, when the likelihood is Gaussian, the conjugate prior is also Gaussian. However, the conjugate prior may not always be the most adequate representation of our prior knowledge. Instead of choosing the conjugate prior for convenience, we have to put on our modelling glasses and make an informed choice for our prior.

Bringing out the big guns

The Bayesian formalism allows for using any prior you like with any noise model you like, and even allows for using hierachical models with hyperpriors on the prior parameters. However, the convenient closed-form solution that we encountered in the previous session is unfortunately a rarity. If we want to exploit the full flexibility of the Bayesian formalism, we have to introduce some bigger guns - inference algorithms.

As mentioned in the very first lesson, there are two principal methods for Bayesian parameter estimation, namely Markov Chain Monte Carlo (MCMC) and Variational Inference (VI). We had a brief look at MCMC in the previous walkthrough, and we are going to expand on that in the next session, where we will model COVID cases. MCMC is asymptotically exact and also easy to implement, and we are going to stick to MCMC for this course for these reasons. VI is usually much faster than MCMC but it requires a little more background to understand. Moreover, it only provides an approximation to the posterior which is biased by the choice of variational distribution.