Gaussian Processes

6.2 Gaussian Processes

Created Date: 2025-05-18

Gaussian processes (GPs) are ubitiquous. You have already encountered many examples of GPs without realizing it. Any model that is linear in its parameters with a Gaussian distribution over the parameters is a Gaussian process. This class spans discrete models, including random walks, and autoregressive processes, as well as continuous models, including Bayesian linear regression models, polynomials, Fourier series, radial basis functions, and even neural networks with an infinite number of hidden units. There is a running joke that "everything is a special case of a Gaussian process".

Learning about Gaussian processes is important for three reasons:

They provide a function space perspective of modelling, which makes understanding a variety of model classes, including deep neural networks, much more approachable;
They have an extraordinary range of applications where they are state-of-the-art, including active learning, hyperparameter learning, auto-ML, and spatiotemporal regression;
Over the last few years, algorithmic advances have made Gaussian processes increasingly scalable and relevant, harmonizing with deep learning through frameworks such as GPyTorch (Gardner et al., 2018).

Indeed, GPs and deep neural networks are not competing approaches, but highly complementary, and can be combined to great effect. These algorithmic advances are not just relevant to Gaussian processes, but provide a foundation in numerical methods that is broadly useful in deep learning.

In this chapter, we introduce Gaussian processes. In the introductory notebook, we start by reasoning intuitively about what Gaussian processes are and how they directly model functions. In the priors notebook, we focus on how to specify Gaussian process priors. We directly connect the tradiational weight-space approach to modelling to function space, which will help us reason about constructing and understanding machine learning models, including deep neural networks.

We then introduce popular covariance functions, also known as kernels, which control the generalization properties of a Gaussian process. A GP with a given kernel defines a prior over functions. In the inference notebook, we will show how to use data to infer a posterior, in order to make predictions. This notebook contains from-scratch code for making predictions with a Gaussian process, as well as an introduction to GPyTorch.

In upcoming notebooks, we will introduce the numerics behind Gaussian processes, which is useful for scaling Gaussian processes but also a powerful general foundation for deep learning, and advanced use-cases such as hyperparameter tuning in deep learning. Our examples will make use of GPyTorch, which makes Gaussian processes scale, and is closely integrated with deep learning functionality and PyTorch.

6.2.1 Introduction to GP

In many cases, machine learning amounts to estimating parameters from data. These parameters are often numerous and relatively uninterpretable — such as the weights of a neural network. Gaussian processes, by contrast, provide a mechanism for directly reasoning about the high-level properties of functions that could fit our data.

For example, we may have a sense of whether these functions are quickly varying, periodic, involve conditional independencies, or translation invariance. Gaussian processes enable us to easily incorporate these properties into our model, by directly specifying a Gaussian distribution over the function values that could fit our data.

Let’s get a feel for how Gaussian processes operate, by starting with some examples.

Suppose we observe the following dataset, of regression targets (outputs) \(y\), indexed by inputs \(x\). As an example, the targets could be changes in carbon dioxide concentrations, and the inputs could be the times at which these targets have been recorded.

What are some features of the data? How quickly does it seem to varying? Do we have data points collected at regular intervals, or are there missing inputs? How would you imagine filling in the missing regions, or forecasting up until \(x = 25\)?

Figure 1 - Observed Data

In order to fit the data with a Gaussian process, we start by specifying a prior distribution over what types of functions we might believe to be reasonable. Here we show several sample functions from a Gaussian process. Does this prior look reasonable?

Note here we are not looking for functions that fit our dataset, but instead for specifying reasonable high-level properties of the solutions, such as how quickly they vary with inputs. Note that we will see code for reproducing all of the plots in this notebook, in the next notebooks on priors and inference.

Figure 2 - Sample Prior Functions

Once we condition on data, we can use this prior to infer a posterior distribution over functions that could fit the data. Here we show sample posterior functions.

Figure 3 - Sample Posterior Functions

We see that each of these functions are entirely consistent with our data, perfectly running through each observation. In order to use these posterior samples to make predictions, we can average the values of every possible sample function from the posterior, to create the curve below, in thick blue. Note that we do not actually have to take an infinite number of samples to compute this expectation; as we will see later, we can compute the expectation in closed form.

Figure 4 - Posterior Samples

We may also want a representation of uncertainty, so we know how confident we should be in our predictions. Intuitively, we should have more uncertainty where there is more variability in the sample posterior functions, as this tells us there are many more possible values the true function could take. This type of uncertainty is called epistemic uncertainty, which is the reducible uncertainty associated with lack of information. As we acquire more data, this type of uncertainty disappears, as there will be increasingly fewer solutions consistent with what we observe. Like with the posterior mean, we can compute the posterior variance (the variability of these functions in the posterior) in closed form. With shade, we show two times the posterior standard deviation on either side of the mean, creating a credible interval that has a 95% probability of containing the true value of the function for any input \(x\).

Figure 5 - Posterior Samples, Including 95% Credible Set

The plot looks somewhat cleaner if we remove the posterior samples, simply visualizing the data, posterior mean, and 95% credible set. Notice how the uncertainty grows away from the data, a property of epistemic uncertainty.

Figure 6 - Point Predictions, and Credible Set

The properties of the Gaussian process that we used to fit the data are strongly controlled by what’s called a covariance function , also known as a kernel. The covariance function we used is called the RBF (Radial Basis Function) kernel, which has the form:

                    \(k_{\textrm{RBF}}(x,x') = \textrm{Cov}(f(x),f(x')) = a^2 \exp\left(-\frac{1}{2\ell^2}||x-x'||^2\right)\)
                

The hyperparameters of this kernel are interpretable. The amplitude parameter \(a\) controls the vertical scale over which the function is varying, and the length-scale parameter \(\ell\) controls the rate of variation (the wiggliness) of the function. Larger \(a\) means larger function values, and larger \(\ell\) means more slowly varying functions. Let’s see what happens to our sample prior and posterior functions as we vary \(a\) and \(\ell\).

The length-scale has a particularly pronounced effect on the predictions and uncertainty of a GP. At \(||x - x'|| = \ell\), the covariance between a pair of function values is \(a^2 exp(-0.5)\). At larger distances than \(\ell\) the values of the function values becomes nearly uncorrelated. This means that if we want to make a prediction at a point \(x_*\), then function values with inputs \(x\) such that \(||x - x'|| \gt \ell\) will not have a strong effect on our predictions.

Let’s see how changing the lengthscale affects sample prior and posterior functions, and credible sets. The above fits use a length-scale of 2. Let’s now consider \(\mathcal{l} = 0.1, 0.5, 2, 5, 10\). A length-scale of 0.1 is very small relative to the range of the input domain we are considering, 25.

For example, the values of the function at \(x = 5\) and \(x = 10\) will have essentially no correlation at such a length-scale. On the other hand, for a length-scale of 10, the function values at these inputs will be highly correlated. Note that the vertical scale changes in the following figures.

Figure 7 - Lengthscale \(\ell = 0.100\) Amplitude \(a = 1\)

Figure 8 - Lengthscale \(\ell = 2.000\) Amplitude \(a = 1\)

Notice as the length-scale increases the 'wiggliness' of the functions decrease, and our uncertainty decreases. If the length-scale is small, the uncertainty will quickly increase as we move away from the data, as the datapoints become less informative about the function values.

Now, let’s vary the amplitude parameter, holding the length-scale fixed at 2. Note the vertical scale is held fixed for the prior samples, and varies for the posterior samples, so you can clearly see both the increasing scale of the function, and the fits to the data.

Figure 9 - Lengthscale \(\ell = 2\) Amplitude \(a = 2.000\)

Figure 10 - Lengthscale \(\ell = 2\) Amplitude \(a = 8.000\)

We see the amplitude parameter affects the scale of the function, but not the rate of variation. At this point, we also have the sense that the generalization performance of our procedure will depend on having reasonable values for these hyperparameters.

Values of \(\ell = 2\) and \(a = 1\) appeared to provide reasonable fits, while some of the other values did not. Fortunately, there is a robust and automatic way to specify these hyperparameters, using what is called the marginal likelihood, which we will return to in the notebook on inference.

So what is a GP, really? As we started, a GP simply says that any collection of function values \(f(x_1), \dots, f(x_n)\) , indexed by any collection of inputs \(x_1, \dots , x_n\) has a joint multivariate Gaussian distribution. The mean vector \(\mu\) of this distribution is given by a mean function, which is typically taken to be a constant or zero. The covariance matrix of this distribution is given by the kernel evaluated at all pairs of the inputs \(x\) :

                    \[\begin{split}\begin{bmatrix}f(x) \\f(x_1) \\ \vdots \\ f(x_n) \end{bmatrix}\sim \mathcal{N}\left(\mu, \begin{bmatrix}k(x,x) & k(x, x_1) & \dots & k(x,x_n) \\ k(x_1,x) & k(x_1,x_1) & \dots & k(x_1,x_n) \\ \vdots & \vdots & \ddots & \vdots \\ k(x_n, x) & k(x_n, x_1) & \dots & k(x_n,x_n) \end{bmatrix}\right)\end{split}\]
                

Equation above specifies a GP prior. We can compute the conditional distribution of \(f(x)\) for any \(x\) given \(f(x_1), \dots , f(x_n)\) , the function values we have observed. This conditional distribution is called the posterior, and it is what we use to make predictions.

In particular,

\[f(x) | f(x_1), \dots, f(x_n) \sim \mathcal{N}(m,s^2)\]

where

\[m = k(x,x_{1:n}) k(x_{1:n},x_{1:n})^{-1} f(x_{1:n})\]

\[s^2 = k(x,x) - k(x,x_{1:n})k(x_{1:n},x_{1:n})^{-1}k(x,x_{1:n})\]

where \(k(x,x_{1:n})\) is a \(1 \times n\) vector formed by evaluating \(k(x,x_{i})\) for \(i=1,\dots,n\) and \(k(x_{1:n},x_{1:n})\) is an \(n \times n\) matrix formed by what we can use as a point predictor for any \(x\) , and \(s^2\) is what we use for uncertainty: if we want to create an interval with a 95% probability that f(x) is in the interval, we would use \(m \pm 2s\).

The predictive means and uncertainties for all the above figures were created using these equations. The observed data points were given by \(f(x_1), \dots, f(x_n)\) and chose a fine grained set of \(x\) points to make predictions.

Let’s suppose we observe a single datapoint, \(f(x_1)\) , and we want to determine the value of \(f(x)\) at some \(x\). Because \(f(x)\) is described by a Gaussian process, we know the joint distribution over \((f(x), f(x_1))\) is Gaussian:

\[\begin{split}\begin{bmatrix} f(x) \\ f(x_1) \\ \end{bmatrix} \sim \mathcal{N}\left(\mu, \begin{bmatrix} k(x,x) & k(x, x_1) \\ k(x_1,x) & k(x_1,x_1) \end{bmatrix} \right)\end{split}\]

The off-diagonal expression \(k(x,x_1) = k(x_1,x)\) tells us how correlated the function values will be — how strongly determined \(f(x)\) will be from \(f(x_1)\). We have seen already that if we use a large length-scale, relative to the distance between \(x\) and \(x_1\) , \(||x-x_1||\) , then the function values will be highly correlated.

We can visualize the process of determining \(f(x)\) from \(f(x_1)\) both in the space of functions, and in the joint distribution over \(f(x_1), f(x)\) .

Let's initially consider an \(x\) such that \(k(x, x_1) = 0.9\) , and \(k(x, x) = 1\) , meaning that the value \(f(x)\) is moderately correlated with the value of \(f(x_1)\) . In the joint distribution, the contours of constant probability will be relatively narrow ellipses.

Suppose we observe \(f(x_1) = 1.2\) . To condition on this value of \(f(x_1)\) , we can draw a horizontal line at 1.2 on our plot of the density, and see that the value of \(f(x)\) is mostly constrained to \([0.64, 1.52]\). We have also drawn this plot in function space, showing the observed point \(f(x_1)\) in orange, and 1 standard deviation of the Gaussian process predictive distribution for \(f(x)\) in blue, about the mean value of 1.08 .

In typical machine learning, we specify a function with some free parameters (such as a neural network and its weights), and we focus on estimating those parameters, which may not be interpretable. With a Gaussian process, we instead reason about distributions over functions directly, which enables us to reason about the high-level properties of the solutions. These properties are controlled by a covariance function (kernel), which often has a few highly interpretable hyperparameters. These hyperparameters include the length-scale, which controls how rapidly (how wiggily) the functions are. Another hyperparameter is the amplitude, which controls the vertical scale over which our functions are varying. Representing many different functions that can fit the data, and combining them all together into a predictive distribution, is a distinctive feature of Bayesian methods. Because there is a greater amount of variability between possible solutions far away from the data, our uncertainty intuitively grows as we move from the data.

A Gaussian process represents a distribution over functions by specifying a multivariate normal (Gaussian) distribution over all possible function values. It is possible to easily manipulate Gaussian distributions to find the distribution of one function value based on the values of any set of other values. In other words, if we observe a set of points, then we can condition on these points and infer a distribution over what the value of the function might look like at any other input. How we model the correlations between these points is determined by the covariance function and is what defines the generalization properties of the Gaussian process. While it takes time to get used to Gaussian processes, they are easy to work with, have many applications, and help us understand and develop other model classes, like neural networks.

6.2.2 GP Priors

Understanding Gaussian processes (GPs) is important for reasoning about model construction and generalization, and for achieving state-of-the-art performance in a variety of applications, including active learning, and hyperparameter tuning in deep learning. GPs are everywhere, and it is in our interests to know what they are and how we can use them.

In this section, we introduce Gaussian process priors over functions. In the next notebook, we show how to use these priors to do posterior inference and make predictions. The next section can be viewed as “GPs in a nutshell”, quickly giving what you need to apply Gaussian processes in practice.

6.2.2.1 Definition

A Gaussian process is defined as a collection of random variables, any finite number of which have a joint Gaussian distribution. If a function \(f(x)\) is a Gaussian process, with mean function \(m(x)\) and covariance function or kernel \(k(x, x')\), \(f(x) ~ GP(m, k)\), then any collection of function values queried at any collection of input points \(x\) (times, spatial locations, image pixels, etc.), has a joint multivariate Gaussian distribution with mean vector \(u\) and covariance matrix \(K\):

                    \(f(x_1), \cdots, f(x_n) ~ N(u, k), where u_i = E[f(x_i)] = m(x_i) and K_{ij} = Cov(f(x_i), f(x_j)) = k(x_i, x_j).\)
                

6.2.2.2 A Simple Gaussian Process

Suppose \(f(x) = w_0 + w_1 x\) , and \(w_0, w_1 \sim \mathcal{N}(0,1)\) , with \(w_0, w_1, x\) all in one dimension. We can equivalently write this function as the inner product \(f(x) = (w_0, w_1)(1, x)^{\top}\).

6.2.2.3 From Weight Space to Function Space

6.2.2.4 The Radial Basis Function (RBF) Kernel

6.2.2.5 The Neural Network Kernel

6.2.3 GP Inference

6.2.3.1 Posterior Inference for Regression

6.2.3.2 Equations for Making Predictions

6.2.3.3 Interpreting Equations for Learning and Predictions

6.2.3.4 Worked Example from Scratch

6.2.3.5 Making Life Easy with GPyTorch

As we have seen, it is actually pretty easy to implement basic Gaussian process regression entirely from scratch. However, as soon as we want to explore a variety of kernel choices, consider approximate inference (which is needed even for classification), combine GPs with neural networks, or even have a dataset larger than about 10,000 points, then an implementation from scratch becomes unwieldy and cumbersome. Some of the most effective methods for scalable GP inference, such as SKI (also known as KISS-GP), can require hundreds of lines of code implementing advanced numerical linear algebra routines.

In these cases, the GPyTorch library will make our lives a lot easier. We'll be discussing GPyTorch more in future notebooks on Gaussian process numerics, and advanced methods. The GPyTorch library contains many examples. To get a feel for the package, we will walk through the simple regression example, showing how it can be adapted to reproduce our above results using GPyTorch. This may seem like a lot of code to simply reproduce the basic regression above, and in a sense, it is. But we can immediately use a variety of kernels, scalable inference techniques, and approximate inference, by only changing a few lines of code from below, instead of writing potentially thousands of lines of new code.