Dr Mikkel Lykkegaard

by Dr Mikkel Lykkegaard

Lesson

Intro to Bayes

3. Unpacking Bayes' Theorem: Prior, Likelihood and Posterior

📑 Learning Objectives
  • Identify various Bayesian schools of thought, objective, subjective and strongly subjective.
  • Understand the different roles of the prior.
  • Define the likelihood function and understand how it differs from a probability distribution.
  • Appreciate the challenge of estimating the evidence / marginal likelihood.

Unpacking Bayes' Theorem


We will now move on to the Bayesian interpretation of Bayes' Theorem. As mentioned in the previous session, Thomas Bayes himself was not a Bayesian, but simply saw Bayes' Theorem as a way to calculate conditional probabilities. So what does it mean to be Bayesian? The short answer is that it involves interpreting probability as an expression of belief, but the long answer is that it varies.

The Different Flavours of Bayesian

The different schools of Baysian statistics can broadly be separated into the objective Bayesians, the subjective Bayesians and the strongly subjective Baysieans. The difference between each school, boil down to their view of the prior (well get more into that later).

  • Objective Bayesians: Believe that the prior should be non-informative, and designed in such a way that it avoids biasing the result. A lot of effort has been invested into this branch of Bayesian statistics, since un-biasedness is attractive to most statisticians. However, it is a rather complicated and theoretical topic, and we are not going to cover it detail in this course. It is often achieved throught the application of the Jeffreys Prior, which we will look into in the next session.
  • Subjective Bayesians: Believe that priors should be constructed on the bases of expert elicitation. That is, the prior should incorporate the current body of knowledge about the parameter of the prior, as described by experts in the field. This allows for encoding prior beliefs into the problems and to exploit previous literature on the topic. A good example is the probability of the sun coming up tomorrow, as explained in this XKCD comic. We are mainly going to take this approach in this course.
  • Strongly Subjective Bayesians: This school of thought is representative of the idea of probability as belief taken to the extreme. Since we can only ever express probability as subjective belief, everything goes, as long as you are explicit about which prior you are using, and that you can qualify and defend your choice of prior. Then your audience can decide for themselves whether they buy into your analysis, since they are fully informed about your prior beliefs.

To put this into perspective, let's briefly revisit Bayes' Theorem for events, and then move on to expanding to continuous random variables.

For events

As explained in the previous session, for events AA and BB, we can write Bayes' Theorem as:

P(A∣B)=P(A)P(B∣A)P(B)P(A|B) = \frac{P(A)P(B|A)}{P(B)}

where

  • P(A)P(A) is the prior probability of AA.
  • P(B∣A)P(B|A) is the likelihood of BB given AA.
  • P(B)P(B) is marginal probability or evidence, and,
  • P(A∣B)P(A|B) is the posterior probability of AA given BB.

The marginal likelihood can be expanded by summing up over all possible events Ai∈A{A_i} \in \mathcal A:

P(B)=∑iP(Ai)P(B∣Ai)P(B) = \sum_i P(A_i)P(B|A_i)

However, Bayes' Theorem can also be employed to continous random variables. Let's have a look at that.

For continuous random variables

Suppose you have some sensor data dd, and you want to perform Bayesian calibration of some model parameters θ\theta using those sensor data. We can then find the posterior density of the model parameters conditional on the sensor data:

p(θ∣d)=p(θ)p(d∣θ)p(d)p(\theta|d) = \frac{p(\theta)p(d|\theta)}{p(d)}

The definitions are the same as above, but since the parameters θ\theta are now continuous, the marginal likelihood becomes an integral:

p(d)=∫p(θ)p(d∣θ) dθp(d) = \int p(\theta)p(d|\theta) \: d\theta

This rarely has an analytical sulution and is often intractible to compute. Workarounds for the challenge exist, including Markov Chain Monte Carlo (MCMC) and Variational Inference (VI), but those are both highly advanced methods that each warrant multiple lectures of their own. A more straightfoward worksaround is the use of conjugate priors, which we will have a brief look at in the next session.

The meaning of it all


Prior

As mentioned in the introduction to this session, the prior means different things to different types of Bayesian. However, the most common perspective is that it expresses a prior belief about the parameters. There are different ways to construct such a prior.

  • First, it could be extracted from the previous body of literature on the topic. For example, if you wanted to test whether actually vaccines work, it would be sensible to take into account the immense body of historical evidence that they do, in fact, work. This (fairly conclusive) historical evidence could then be encoded as a prior, so that it would require a lot of new data to change that conclusion.
  • Second, priors can be distilled through expert elicitation. In this case, domain experts are be interrogated on their beliefs about model parameters, and theor knowlegde is then converted into a prior distribution. If we wanted to calculate the probablity that the sun is going to rise tomorrow, we might ask an atrophysicist their opinions. They would look at their models, maybe check a table of the positions of heavenly bodies, and provide their prior belief as to whether the sun will rise tomorrow.
  • Third, the prior could express physical constraints of the model parameters. Maybe we are trying the infer the posterior distribution of cyclists passing our house each day. We know that such a parameter must a positive integer, and we can encode that knowledge into our prior.

In any case, the prior is nevessary for Bayes' Theorem to do its job, and it is not always obvious how to define it. In the next session I will show you how the choice of prior affects the outcome, and give some practical guidance on the choice of prior.

Likelihood

The likelihood is something that, broadly, measures the misfit between the model and the observations, as a function of the model parameters. If the we ignore the prior, and simply maximise the likelihood function, we get the Maximum Likelihood Estimate, which is equivalent of the Frequentist parameter estimate.

Importantly, the likelihood looks a lot like a probability distribution, but it is not a probability distribution, since it does not integrate to 1. Let's see why.

Suppose you are flipping a coin a bunch of times and counting the heads. That process is called a Bernoulli process, and is modelled using a binomial distribution:

k∼B(n,h)k \sim \mathcal B(n,h)

here, the random variable kk, the number of heads, follows a binomial distribution with parameters nn, the number of trials, and hh, the probability of getting a heads. This distribution has support k={0,…,n}k = \{0, \dots, n\}.

Suppose now we have completed an experiment and have a measurement of kk heads after nn trials. We would now like to compute our posterior probability distribution hh, i.e. the probability of getting a heads. If we fix nn (the number of trials) we get:

p(h∣k)=p(h)B(k∣h)p(k)p(h|k) = \frac{p(h) \mathcal B(k|h)}{p(k)}

Since B(k∣h)\mathcal B(k|h) is now seen as a function of hh, it is no longer a valid probability distribution. The binomial dsitribution is constructed such that the probability mass of any possible outcome kk sums up to 1. When we flip it around and make it a function of hh, it is not even a discrete function anymore, but a continuous function with domain h∈[0,1]h \in [0,1].

This is a bit of an intricate detail, and it surprises many people a first. It certainly surprised me. However, getting an intuition of what a likelihood function does and how it works is very important for the remainder of this course. I will give an example in the next session where we will also look depper into the coin flipping experiment.

Evidence

The evidence is broadly a normalisation constant which makes sure that the posterior is a valid probability distribution. The deeper meaning of it is not very relevant for this course, but it is important to appreciate that it can be hard to actually calculate in practice, since it is an integral over all model parameters:

p(d)=∫p(θ)p(d∣θ) dθp(d) = \int p(\theta)p(d|\theta) \: d\theta

This makes Baysian Inference particularly challenging in high dimensions. Different workarounds to this challenge exist, namely:

  • Conjugate priors: Using appropriate priors, the posterior will have a closed-form solution, and we can avoid calculating the evidence alltogether. This is fast, exact and works for a variety of problems. However, most problems cannot be solved this way.
  • Variational Inference: Here we use a variational distribution that does have a closed form solution to approximate the posterior. This is a very flexible approach, but it is only approximate and does introduce some bias in the posterior.
  • Markov chain Monte Carlo: This is an exact method for drawing samples from the posterior. However, it can be extremely computiationally costly, and should only be used when no other method is adequate.

Posterior

The posterior is the actual target of all this shenanigans. It expresses our beliefs about our model parameters, given that we have observed some data, conditioned on our prior beliefs about the model parameters.

I am not going to dwell further on this. Let's instead have a look at an example.