Bayesian Linear Regression: A Coding Walkthrough

📂 Resources

Download the resources for this lesson here.

📑 Learning Objectives

Use ridge regression to regularise an ill-conditioned system of equations.
Compute the Bayesian posterior distribution analytically for a linear model with a Gaussian prior and likelihood.
Code a simple Markov Chain Monte Carlo (MCMC) algorithm and draw samples from the posterior distribution.

Baysian Linear Regression in Practice: A Coding Walkthrough.

Let's define an ill-conditioned linear least squares problem and find and estimate for the regression coefficients using ordinary least squares.

# ill-conditioned least squares problem, see Hansen (2010)
A = np.array([[0.16, 0.10],
              [0.17, 0.11],
              [2.02, 1.29]])
b = np.array([0.27, 0.25, 3.33])

# find the ordinary least-squares solution
beta = np.linalg.lstsq(A, b, rcond=None)[0]

# print the solution
print(beta)
[ 7.00888731 -8.39566299]

This is looks fine. It is a solution, but it turns out to be an outlier in the Bayesian posterior. The first thing we will do to try and find a better solution is to regularise the least squares problem using ridge regression.

# set the range of regularisation parameters that we test.
lambdas = np.logspace(-4,-3,101)

# set up a list to hold all the solutions.
betas = []

# iterate through the regularisation parameters.
for lamb in lambdas:
    # get the ridge estimate for the current regularisation parameters.
    beta = np.linalg.multi_dot((np.linalg.inv(np.dot(A.T, A) + lamb*np.eye(2)), A.T, b))
    # append it to the list of estimates.
    betas.append(beta)

# convert the list of estimates to a NumPy array.
betas = np.array(betas)

# plot the regularisation parameter against the norm of the ridge estimates.
plt.plot(lambdas, np.linalg.norm(betas, axis=1))
plt.show()

Here we see the norm of the regression coefficients decaying as we turn up the intensity of the regularisation. Let's print out the regularised solution corresponding to $\lambda = 0.001$ .

# print the beta corresponding to lambda = 0.001
print(betas[-1])
[1.19815118 0.70322534]

This is much closer to the "true" solution of $x = [1,1]^\intercal$ , but it is still a point estimate. Let's give the problem the Bayesian treatment.

# import the multivariate normal for the MCMC
from scipy.stats import multivariate_normal

# set the prior mean
mu_prior = np.zeros(2)
# set the prior covariance
sigma_prior = 1
cov_prior = sigma_prior**2*np.eye(2)

# define the prior.
prior = multivariate_normal(mean=mu_prior,
                            cov=cov_prior)

# set the standard deviation of the observation noise.
sigma_data = 0.1
# set the covariance of the observation noise
cov_data = sigma_data**2*np.eye(3)

# define the likelihood
# NOTE: this is not strictly correct since the mean should be the model output
# and the support should be the observations, and here they are swapped. it is
# just slightly more convenient like this and it doesn't matter for the Gaussian
# since that is symmetric with respect to the mean and the support.
likelihood = multivariate_normal(mean=b,
                                 cov=cov_data)

# compute the posterior mean and covariance
# see Bishop's Pattern Recognition and Machine learning
cov_post = np.linalg.inv(np.linalg.inv(cov_prior) + sigma_data**(-2)*np.dot(A.T, A))
mu_post = np.dot(cov_post, np.dot(np.linalg.inv(cov_prior), mu_prior) + sigma_data**(-2)*np.dot(A.T, b))

# define the posterior as a multivariate normal distribution
posterior = multivariate_normal(mean=mu_post, cov=cov_post)

# print the posterior mean
print(mu_post)
[1.17108637 0.74162625]

# print the posterior covariance
print(cov_post)
[[ 0.29074825 -0.45261217]
 [-0.45261217  0.71048368]]

Here, clearly the posterior mean is very close to the ridge estimate. This is not a coincidence and there is a subtle connection between the two. However, we also get an estimate for the posterior covariance, which is something that was missing in the ridge estimate. We get the entire distribution of possible model parameters, rather than a point estimate. Let's plot the posterior and see how it looks like.

# set up a grid to evaluate the posterior distribution.
x = np.linspace(-1, 4,1000)
y = np.linspace(-2, 3,1000)
X, Y = np.meshgrid(x,y)
grid = np.c_[X.reshape(-1,1), Y.reshape(-1,1)]

# evaluate the posterior on the grid
posterior_test = posterior.pdf(grid).reshape(X.shape)

# plot the posterior
plt.figure(figsize=(6,6))
plt.imshow(posterior_test, origin='lower', extent=(x.min(), x.max(), y.min(), y.max()))
plt.show()

Here we can see that the "true" solution $x = [1,1]^\intercal$ is covered by the posterior distribution and even has a fairly high density. The probable solutions are distributed on a line, however, and there are clearly many possible parameters that can reproduce the data. This is because the moment matrix $A^T A$ is near-singular. However, using the Bayesian approach we were able to identify all the possible solutions with very little effort. Let's finally set up an MCMC sampler and draw some samples from the posterior.

# set up a Markov Chain Monte Carlo sampler to sample the posterior.

# set the preconditioned Crank-Nicolson (pCN) scaling
beta = 0.1
# set the number of samples
n_samples = 12000

# draw the initial sample
samples = [prior.rvs()]

# iterate through the number of samples
for i in range(n_samples):

    # generate a pCN proposal
    proposal = np.sqrt(1 - beta**2)*samples[-1] + beta*prior.rvs()

    # compute the acceptance probability
    alpha = np.exp(likelihood.logpdf(np.dot(A, proposal)) - \
                   likelihood.logpdf(np.dot(A, samples[-1])))

    # draw a random number and accept/reject
    if np.random.uniform() < alpha:
        samples.append(proposal)
    else:
        samples.append(samples[-1])

# convert the samples (minus the burnin of 2000 samples) to a NumPy array
samples = np.array(samples[2000:])

# plot the postterior samples
plt.figure(figsize=(6,6))
plt.scatter(samples[:,0], samples[:,1], alpha=0.01, c='k')
plt.xlim(x.min(), x.max())
plt.ylim(y.min(), y.max())
plt.show()

These samples indeed look like they are from the same distribution as the analytical solution, even though they were generated in a completely different way. And they are exactly from that distribution. But why would we use MCMC when we can just compute the exact posterior? Most often we are no so fortunate that there exists a analytical posterior, but MCMC works anyway. We will have a closer look at that in the next lesson. Here are some good ressources on MCMC for the curious of you:

References

Brooks, S. (Ed.). (2011). Handbook for Markov chain Monte Carlo. Taylor & Francis. Liang, F., Liu, C., & Carroll, R. J. (2010). Advanced Markov chain Monte Carlo methods: Learning from past samples. Wiley. Liu, J. S. (2008). Monte Carlo strategies in scientific computing (2. ed). Springer.

Bayesian Linear Regression and parameter Estimation

6. Bayesian Linear Regression: A Coding Walkthrough

Baysian Linear Regression in Practice: A Coding Walkthrough.

References

Return to Lesson Index

Next Lesson