Gaussian Processes for Regression

Priors over functions, kernels, posterior conditioning, hyperparameters, and acquisition.

A Gaussian process is the infinite-dimensional version of a multivariate normal. Pick any finite set of inputs $x_1,\ldots,x_n$ and the corresponding random function values are jointly Gaussian:

$$ f(x_{1:n}) \sim \mathcal{N}(m(x_{1:n}), K), \qquad K_{ij}=k(x_i,x_j). $$

The mean function says where functions live before data. The kernel says which inputs move together. Regression is just conditioning that joint Gaussian on observed values.

One sentence: a GP prior is a distribution over functions; after observations, the posterior mean interpolates the data while the posterior variance pinches near observations and expands away from them.

1. Kernel zoo

The kernel encodes smoothness, length-scale, signal variance, periodicity, and stationarity. The heatmap in each mini-panel is the covariance matrix over fixed input locations; the curves are prior samples from that covariance.

Figure 1 · Kernel matrices and prior sample functions

prior sample functions kernel covariance heatmap

length-scale ℓ 0.22

signal σ_f 1

2. Prior to posterior by conditioning

For noisy observations $y=f(X)+\epsilon$, $\epsilon\sim\mathcal{N}(0,\sigma_n^2I)$, the posterior at a test point $x_*$ has

$$ \mu_*(x_*) = k_*^T(K+\sigma_n^2I)^{-1}y,\qquad \sigma_*^2(x_*) = k(x_*,x_*) - k_*^T(K+\sigma_n^2I)^{-1}k_*. $$

Click the plot to add an observation. Drag existing observations. The band shows roughly 95% posterior uncertainty.

Figure 2 · Click-to-condition Gaussian process regression

posterior mean 95% band observations

length-scale ℓ 0.22

noise σ_n 0.12

Figure 2b · Gaussian conditioning as a slice through a joint ellipse

joint prior over two function values observed coordinate conditional distribution

$x_1$ 0.30

$x_2$ 0.62

observed $f(x_1)$ 1.0

3. Length-scale as model complexity

Short length-scales let nearby observations vary independently; long length-scales force the function to move as a broad sheet. The same data can look overfit, reasonable, or underfit depending on $\ell$.

Figure 3 · Same observations under three length-scales

4. Marginal likelihood landscape

The log marginal likelihood scores hyperparameters by integrating out the latent function:

$$ \log p(y\mid X,\theta)= -\frac12 y^T(K_\theta+\sigma_n^2I)^{-1}y -\frac12\log|K_\theta+\sigma_n^2I| -\frac n2\log(2\pi). $$

Click the landscape to update the posterior below it. The optimum balances fit, uncertainty, and complexity.

Figure 4 · Hyperparameter learning by marginal likelihood

Figure 4b · Same data, different kernels, different assumptions

RBF periodic rough Matérn

5. Two-dimensional regression toy

The same conditioning formula works over any input space. In two dimensions, the posterior mean becomes a surface. The heatmap below draws that mean; opacity fades where posterior uncertainty is high. Drag training points to reshape the surface.

Figure 5 · 2-D posterior mean with uncertainty transparency

6. Acquisition teaser

Bayesian optimization uses the GP posterior to choose where to evaluate next. Expected improvement is high where the mean is promising, the uncertainty is large, or both.

Figure 6 · Expected improvement under the current posterior

posterior mean expected improvement next query

What to remember

GP priorEvery finite set of function values is jointly Gaussian.

KernelThe covariance rule that encodes smoothness, scale, and structure.

ConditioningPosterior mean and variance come from Gaussian conditioning.

Marginal likelihoodA data-driven score for kernel hyperparameters.

LimitationsExact GP regression costs $O(n^3)$, and the kernel assumption matters.