Bayesian Neural Networks

Weight uncertainty, predictive distributions, Laplace approximation, and evidence in the smallest neural net that still shows the ideas.

A Bayesian neural network puts a prior on weights and conditions that prior on data. Even a tiny nonlinear model can produce a non-Gaussian posterior in weight space, while the predictive distribution is a distribution over functions. The widgets below keep the parameter space two-dimensional so the posterior can be drawn directly.

1. Posterior over a tiny neural network

The model is $f(x)=w_1\tanh(w_2x)$. The energy $M(w)=\beta E_D+\alpha E_W$ combines data error and weight decay. Larger $\alpha$ favors smaller weights; larger $\beta$ trusts the observations more. Minimizing $M$ on its own is ordinary training (squared-error loss with weight decay), and its minimizer is the MAP weight $w^*$; everything Bayesian on this page is built on that same $w^*$. Toggle the heatmap to see the two terms separately: $\beta E_D$ is the data-fit landscape (independent of $\alpha$); $\alpha E_W$ is the prior-decay landscape (concentric circles around the origin); their sum is the full $M$. On the right panel: click to add a data point, drag to move one, option/alt-click to remove.

Figure 1 · Weight posterior and predictive fan
high posterior mass MAP function posterior predictive draws

2. Laplace approximation, and what it misses

Laplace replaces the posterior with a Gaussian centred at the MAP, using the curvature of $M$ at the MAP as the inverse covariance. That curvature decomposes the same way as the energy: $A = \nabla\nabla M\big|_{w^*} = \alpha I + \beta H$, where $H = \nabla\nabla E_D$ is the data-only Hessian. The prior contributes the same $\alpha$ along every direction (the isotropic $\alpha I$ term); the data contributes more along directions where it pins the weights more tightly (large eigenvalues of $\beta H$). The dashed lines through the MAP are the eigenvectors of $A$, the ellipse's principal axes, with semi-axis $1/\sqrt{\mu_i}$ where $\mu_i = \alpha + \lambda_i$ is the corresponding eigenvalue of $A$ and $\lambda_i$ is the corresponding eigenvalue of $\beta H$.

Laplace is exact when the posterior really is Gaussian, but the model here has a sign symmetry $f(x; w_1, w_2) = f(x; -w_1, -w_2)$, so the true posterior is bimodal. Laplace sits on top of one mode and ignores the other. When a second mode is present, a faded ghost ellipse marks where Laplace would land at the reflected MAP; on the right panel, the dashed green curve shows $f(x;-w_1^*,-w_2^*)$. It coincides exactly with the solid blue MAP curve, which is why both modes give the same predictions and why predictive uncertainty is unaffected by which mode Laplace picks.

The Laplace predictive variance at input $x$ is $\sigma^2(x) = 1/\beta + g(x)^\top A^{-1} g(x)$, where $g(x) = \nabla_w f(x; w^*)$. Linearizing $f$ around $w^*$ leaves the predictive mean exactly at $f(x; w^*)$. The Bayesian treatment corrects only the spread, not the central prediction. The first term is observation noise (constant in $x$); the second is parameter uncertainty (small near where the data pins $f$, large in extrapolation). The toggle on the right panel switches between the two components and their quadrature sum.

The dashed teal ellipse is the mean-field variational fit — a Gaussian forced to be axis-aligned, $q(w) = q(w_1)\,q(w_2)$ with no covariance term. Toggle the KL direction below the panel. Reverse KL $\mathrm{KL}(q\,\|\,p)$, the one the ELBO actually minimizes, is mode-seeking: it sits on one mode and shrinks inside the true spread (its per-axis precision is the diagonal of $A$, which ignores the correlation). Forward KL $\mathrm{KL}(p\,\|\,q)$ is mass-covering: on the bimodal symmetric-data preset it stretches to straddle both modes and badly over-covers. Neither can tilt to follow the posterior's correlation. That is the lecture's point about factorized-Gaussian VI underfitting the BNN posterior.

Drag the black weight probe anywhere in the left panel to read off the function it produces, drawn in black on the right. Move it onto $+w^*$ and then $-w^*$ to watch the two curves land exactly on top of each other; place it inside versus outside the Laplace ellipse to see what one standard deviation of posterior uncertainty looks like as functions.

Figure 2 · Weight-space (left) and predictive (right): Laplace vs the truth
true posterior contour Laplace ellipse + eigen-axes MAP function f(x; w*) reflected MAP f(x; −w*) true predictive band Laplace predictive band — switch component below mean-field VI ellipse (axis-aligned) weight probe and its function

Figure 2 showed one symmetry of this tiny model, the sign flip $f(x;w_1,w_2) = f(x;-w_1,-w_2)$. A network with more than one hidden unit carries a far larger one: the hidden units can be permuted. The figure below uses a two-unit net $f(x) = w_a\tanh(w_c x) + w_b\tanh(w_d x)$. Swapping the two units leaves $f$ untouched (a sum does not care about order) but moves the weight vector to a different point. Every posterior mode therefore has $K!$ identical twins, and Laplace sits on exactly one of them. Drag a unit in the weight plane; press Swap and watch the output curve hold perfectly still.

Figure 2b · Permutation symmetry: one function, K! weight vectors
hidden unit 1 hidden unit 2 network output f(x)

3. Evidence and Occam's Hill

The log evidence $\log p(\mathcal D \mid \alpha, \beta)$ is what model comparison maximises. In MacKay's framing it is a ratio of three partition functions, $p(\mathcal D \mid \alpha, \beta) = Z_M(\alpha,\beta) / \big[Z_W(\alpha)\, Z_D(\beta)\big]$, where $Z_W(\alpha) = \int e^{-\alpha E_W} dw = (2\pi/\alpha)^{d/2}$ is the prior normaliser, $Z_D(\beta) = (2\pi/\beta)^{N/2}$ is the noise-model normaliser, and $Z_M(\alpha,\beta) = \int e^{-M(w)} dw$ is the posterior partition function. Under Laplace, $Z_M \approx e^{-M(w^*)} (2\pi)^{d/2}/\sqrt{\det A}$, so $\log p(\mathcal D \mid \alpha, \beta) \approx -M(w^*) - \tfrac12\log\det A + \tfrac d2\log\alpha + \tfrac N2\log\beta + \text{const}$.

The figure below is exactly a grid integral of $\exp(-M(w))$ over the same $(w_1, w_2)$ as Figure 1, repeated for many values of the prior precision $\alpha$. Too small $\alpha$ wastes prior mass on regions far from the data ($Z_W$ huge, $Z_M$ barely larger); too large $\alpha$ pulls the prior away from where the MAP sits ($Z_W$ tiny but $Z_M$ tiny faster). The peak is the Occam-optimal regulariser. The three insets show the weight-space posterior at small, peak, and large $\alpha$. At small $\alpha$ the posterior is broad and most of its mass sits far from the data minimum; at large $\alpha$ the posterior contracts onto the origin and away from the data minimum.

The dashed green curve overlays held-out test error (drawn on its own scale; the minimum's location is what matters). The evidence peak and the test-error minimum sit at essentially the same $\alpha$: the evidence picks the generalizing model with no validation set. They decouple when the prior or model is mismatched (try the sparse preset, or drag an outlier into Figure 1's data). Low evidence alongside still-decent test error is the lecture's signature of model–data mismatch.

Figure 3 · log evidence vs α with weight-space snapshots
log evidence fit at MAP prior + complexity penalty current α (Figure 1) held-out test error (dashed, own scale) weight-space insets at three α

4. Estimating α and β from the data

Sliding $\alpha$ until Figure 3's curve peaks is the manual version of model selection. The evidence gradient gives a closed-form iterative update: at the current MAP, compute the eigenvalues $\lambda_i$ of $\beta H$ (so the eigenvalues of $A$ are $\mu_i = \alpha + \lambda_i$), define the effective number of parameters $\gamma = \sum_i \lambda_i / (\alpha + \lambda_i) = d - \alpha\,\mathrm{tr}(A^{-1})$, then set $\alpha \leftarrow \gamma / \lVert w^* \rVert^2$ and $\beta^{-1} \leftarrow (N - \gamma)^{-1} \sum_n (y_n - f(x_n; w^*))^2$. Refit the MAP, recompute, repeat. The fixed point is the evidence-optimal $(\alpha, \beta)$.

$\gamma$ counts how many directions in weight space are determined by the data: a direction with $\lambda_i \gg \alpha$ contributes ≈ 1 (data-determined); a direction with $\lambda_i \ll \alpha$ contributes ≈ 0 (prior-determined). For this 2-parameter model $\gamma \in [0, 2]$. The $\alpha$ update can then be read as "the prior precision should be just tight enough to absorb the $\gamma$ data-determined dimensions of $w$"; the $\beta$ update is the unbiased noise estimator using $N - \gamma$ residual degrees of freedom.

Figure 4 · MacKay evidence-maximisation: α, β, γ trajectories
posterior contour at current α, β prior 1-σ circle (radius 1/√α) MAP trajectory α(t) β(t) γ(t)

5. Prior mismatch: when one precision isn't enough

The previous section assumes a single prior precision $\alpha$ covers all weights. But weights in different layers measure different things, so a single $\alpha$ implicitly couples them, and the coupling is not unit-consistent. Rescale inputs $x \to cx$: the function $f(x;w_1,w_2) = w_1\tanh(w_2 x)$ is unchanged if we also send $w_2 \to w_2/c$. A sensible prior should be invariant under this joint rescaling. But the isotropic prior $p(w) \propto \exp\big(-\tfrac\alpha 2 (w_1^2 + w_2^2)\big)$ treats $w_1$ and $w_2$ on equal footing: rescaling inputs changes the effective prior on $w_2$ but not $w_1$, which means the prior carries an arbitrary choice of input units. The cure is one precision per layer ($\alpha_{\text{out}}$ on $w_1$ and $\alpha_{\text{in}}$ on $w_2$), chosen by the same MacKay updates from §4 applied separately to each group. This is the basis of automatic relevance determination.

Before the prior story: what does each weight do on its own? The two panels below show $f(x)=w_1\tanh(w_2x)$ with one weight swept and the other fixed. $w_1$ scales the output amplitude; $w_2$ scales the input, which changes how steep the tanh is around $x=0$. A prior that constrains $w_1$ therefore constrains amplitude; a prior that constrains $w_2$ constrains how nonlinear the function can be.

w₁ swept (w₂ = 1.5): amplitude
w₂ swept (w₁ = 1.2): steepness

Per-layer prior precision now decouples these: $\alpha_{\text{in}}$ regulates $w_2$ (slope inside the tanh; how steep the nonlinearity can be); $\alpha_{\text{out}}$ regulates $w_1$ (output amplitude). The top panel draws prior-predictive functions (before seeing data); the bottom panel draws posterior-predictive functions conditioned on the same eight data points. When the prior is mis-scaled, the prior band can't even bracket the data and the posterior is pushed into a corner of weight space. Click to add a data point, drag a point, option/alt-click to remove. The top (prior) band ignores the data, so only the bottom band moves.

Figure 5 · Prior-predictive (top) vs posterior-predictive (bottom)
prior-predictive band posterior-predictive band data

6. Where Laplace stops working

Everything on this page assumes the Gaussian Laplace approximation around a single mode is a reasonable picture of the posterior. That assumption fails when the posterior is genuinely multimodal in ways that do change the predictive distribution, when the network is deep enough that the central-limit argument no longer justifies Gaussianity, or when the dataset is too big to compute the full Hessian. The alternatives: sample the posterior, or optimize a tractable approximation to it.

Hamiltonian Monte Carlo (HMC). Augments $w$ with a momentum vector and integrates Hamiltonian dynamics on $-\log p(w \mid \mathcal D) + \tfrac12 \lVert p\rVert^2$ using leapfrog steps, then accepts/rejects via Metropolis. Gradient-informed trajectories let it move long distances per accepted step and explore high-dimensional posteriors that random-walk MH cannot. Covered with an interactive on the sampling-methods page, including a "BNN posterior target" preset that uses the same energy surface as Figure 1 here.

Stochastic Gradient Langevin Dynamics (SGLD). The minibatch-scale relative of HMC. Each step is an SGD update with carefully scaled Gaussian noise injected: $w_{t+1} = w_t - \tfrac{\eta_t}2 \nabla \tilde M(w_t) + \sqrt{\eta_t}\,\xi_t$ with $\xi_t \sim \mathcal N(0, I)$, where $\nabla \tilde M$ is the stochastic-gradient estimator on a minibatch. In the small-step-size limit the iterates form a Markov chain whose stationary distribution is the posterior $p(w \mid \mathcal D)$. Unlike HMC it does not require a full-batch gradient or a Metropolis correction, which is why it scales to real-world neural nets. SGLD belongs alongside HMC on the sampling-methods page; it is not implemented there yet.

Two non-sampling routes round out the picture. Variational inference fits a tractable $q(w)$ by optimization rather than sampling, though a fully factorized Gaussian $q$ tends to underfit the weight posterior's correlations and multimodality. MC-Dropout reuses dropout at test time: each forward pass drops a random set of units, and the spread across passes is a cheap posterior approximation. Monte Carlo noise injection is its close cousin: instead of dropping units it perturbs the weights with noise on each pass, reading uncertainty off the same forward-pass spread.

The figure runs MC-Dropout on a 14-unit network fitted to the red points. Each test-time forward pass keeps each hidden unit only with probability $1-p$ (kept units rescaled by $1/(1-p)$), so every pass is a different thinned sub-network. The spread of those sub-networks is the predictive uncertainty, wide in extrapolation and narrow where the data pins the fit. Slide $p$ and watch the band inflate with no new data: unlike the Laplace band, whose width comes from the data through the Hessian, the MC-Dropout band's width is set by the free knob $p$. Switch the perturbation to MC noise injection and the same picture emerges from a different mechanism: Gaussian noise added to every weight on each pass instead of dropped units.

Figure 6 · MC-Dropout and noise injection: a band from random perturbations
training data full network (unperturbed) perturbed sub-networks + predictive band

What next

Bayesian neural nets share computation and function-space intuition with other Bayesian pages.