Bayesian Parameter Estimation

📑 Learning Objectives

Formulate the general data-generating model for Bayesian parameter estimation.
Choose a prior and likelihood function for your Bayesian parameter estimation problem.
Identify algorithms for Bayesian Inference, namely Markoc Chain Monte Carlo (MCMC) and Variational Inference (VI)

Bayesian Parameter Estimation

The data-generating model

Let's first revisit the model from the previous lesson, namely

\bf Ax + \varepsilon = b

This is written in the usual ordinary linear least squares notation, but in the Bayesian formalism it usually takes the form

d_{obs} = \mathcal F(\theta) + \varepsilon

where $d_{obs}$ is a vector of observations ( $\bf b$ in the previous notation), $\theta$ is a vector of model parameters ( $\bf x$ in the previous notation) for which we seek the posterior distribution, and $\mathcal F$ is the so-called forward operator which describes how the model maps parameters $\theta$ to noise-free observations $d_{obs} - \varepsilon$ . This operator can be anything, including a linear map as before, e.g. $\mathcal F(\theta) = \bf A \theta$ .

Another look at the likelihood and the prior

When the noise $\varepsilon$ is Gaussian, we can write the likelihood as $p(d_{obs}|\theta) = \mathcal N(d_{obs} - \mathcal F(\theta), \sigma_\varepsilon^2 {\bf I})$ , or, more generally, $p(d_{obs}|\theta) = \mathcal N(d_{obs} - \mathcal F(\theta), \Sigma_\varepsilon)$ , which is equivalent of

p(d_{obs}|\theta) = (2\pi)^{-d/2} |\Sigma_{\varepsilon}|^{-1/2} \exp \left( -\frac{1}{2}(d_{obs} - \mathcal F(\theta))^\intercal \: \Sigma_{\varepsilon}^{-1} \: (d_{obs} - \mathcal F(\theta))\right)

where $\Sigma_\varepsilon$ is the covariance matrix of the Gaussian noise. However, the noise is also not always Gaussian. For example, if we are trying to model something countable (number of cars passing a certain location, number of animals in a nature reserve, etc.) we may want to use e.g. a Poisson likelihood instead.

After correctly identifying the noise model and hence the likelihood function, the task is now to choose a prior for the model parameters $\theta$ , i.e. $p(\theta)$ to complete Bayes' Theorem:

p(\theta|d*{obs}) = \frac{p(\theta) p(d*{obs}|\theta)}{p(d\_{obs})}

As we saw in the previous session, when the likelihood is Gaussian, the conjugate prior is also Gaussian. However, the conjugate prior may not always be the most adequate representation of our prior knowledge. Instead of choosing the conjugate prior for convenience, we have to put on our modelling glasses and make an informed choice for our prior.

Bringing out the big guns

The Bayesian formalism allows for using any prior you like with any noise model you like, and even allows for using hierachical models with hyperpriors on the prior parameters. However, the convenient closed-form solution that we encountered in the previous session is unfortunately a rarity. If we want to exploit the full flexibility of the Bayesian formalism, we have to introduce some bigger guns - inference algorithms.

As mentioned in the very first lesson, there are two principal methods for Bayesian parameter estimation, namely Markov Chain Monte Carlo (MCMC) and Variational Inference (VI). We had a brief look at MCMC in the previous walkthrough, and we are going to expand on that in the next session, where we will model COVID cases. MCMC is asymptotically exact and also easy to implement, and we are going to stick to MCMC for this course for these reasons. VI is usually much faster than MCMC but it requires a little more background to understand. Moreover, it only provides an approximation to the posterior which is biased by the choice of variational distribution.

Bayesian Linear Regression and parameter Estimation

7. Bayesian Parameter Estimation