Free Energy & Variational Inference

How an intractable Bayesian posterior turns into an optimization problem.

Bayesian inference asks: given data $y$ and a model $p(y,\theta) = p(y\mid\theta)\,p(\theta)$, what is the posterior $p(\theta\mid y)$? In principle you just apply Bayes: $p(\theta\mid y) = p(y,\theta)/p(y)$. In practice the marginal evidence $p(y) = \int p(y,\theta)\,d\theta$ is a high-dimensional integral that is almost never available in closed form. Variational inference sidesteps the integral by replacing "find the posterior" with "find the closest tractable distribution to the posterior", turning inference into optimization.

The posterior is the prior, tilted. Bayes' rule has a one-line measure-theoretic reading: the posterior measure is absolutely continuous with respect to the prior, with Radon–Nikodym derivative $$ \frac{dP_{\theta\mid y}}{dP_\theta}(\theta) \;\propto\; p(y\mid\theta). $$ Updating doesn't create probability out of nothing; it reweights the prior by the likelihood and renormalizes. Variational inference is the case where this tilt is intractable, so we minimize a KL gap to an approximating $q$ instead. Conjugate cases (see named distributions) are the ones where the tilt stays inside a finite-dimensional exponential family and updates have closed form.

The closeness measure is the Kullback-Leibler divergence, and the quantity we actually optimize is the variational free energy (also known, with a sign flip, as the ELBO). Together they sit inside the single identity

$$ \ln p(y) \;=\; \underbrace{\mathrm{KL}\!\bigl[\,q(\theta)\,\Vert\,p(\theta\mid y)\,\bigr]}_{\geq 0} \;+\; F(q,y). $$

This page uses KL as a building block. For the standalone intuition behind KL, including categorical examples and forward-vs-reverse behavior, see KL Divergence. Three interactive figures below show how the same directed gap becomes an inference algorithm.

1. KL inside variational inference

For two densities $q$ and $p$ on the same space, the KL divergence is

$$ \mathrm{KL}[q\,\Vert\,p] \;=\; \int q(\theta)\,\ln\frac{q(\theta)}{p(\theta)}\,d\theta \;=\; \mathbb{E}_q\!\left[\ln\frac{q(\theta)}{p(\theta)}\right]. $$

It is non-negative, zero only when $q=p$ almost everywhere, and directed: $\mathrm{KL}[q\Vert p] \neq \mathrm{KL}[p\Vert q]$ in general. Variational inference uses $\mathrm{KL}[q\Vert p(\theta\mid y)]$, the reverse or mode-seeking direction, because that is the one that drops out of the algebra below.

Drag $q$ and $p$. The shaded gap between the curves on the bottom plot is the integrand $q(\theta)\ln\bigl(q(\theta)/p(\theta)\bigr)$, weighted by $q$. The point for VI is that changing $q$ changes both where the approximation puts mass and where the mismatch is measured.

Figure 1 · $\mathrm{KL}[q\Vert p]$ between two Gaussians
$q(\theta)$ $p(\theta)$ integrand of KL
Figure 1a · Reverse KL mode-seeking on a bimodal target
target posterior $p(\theta\mid y)$ single-Gaussian $q(\theta)$ objective landscape

2. The variational identity

The evidence is a log-partition function. Identify $E(\theta) = -\ln p(y,\theta)$. Then the marginal evidence is literally a partition function over parameters: $$ p(y) \;=\; \int p(y,\theta)\,d\theta \;=\; \int e^{-E(\theta)}\,d\theta \;=\; Z, $$ and the posterior is the Gibbs distribution at temperature $1$: $p(\theta\mid y) = e^{-E(\theta)}/Z$. The Legendre-duality identity $\ln Z = \sup_q\!\bigl(\mathbb{E}_q[-E] + H(q)\bigr)$ is then exactly the ELBO, achieved when $q = p(\theta\mid y)$. So variational inference is statistical mechanics on probability distributions: minimizing free energy, with the posterior as the equilibrium and the KL gap as the excess free energy.

Start from Bayes' rule, $p(y,\theta) = p(\theta\mid y)\,p(y)$, take logs, and play the classic multiply-and-divide-by-$q(\theta)$ trick:

$\displaystyle \ln p(y) \;=\; \ln\frac{p(y,\theta)}{p(\theta\mid y)}$
Bayes' rule, rearranged

$\displaystyle \phantom{\ln p(y)} \;=\; \int q(\theta)\,\ln\frac{p(y,\theta)}{p(\theta\mid y)}\,d\theta$
Multiply by $q(\theta)$ and integrate; $\ln p(y)$ is constant in $\theta$, $\int q = 1$

$\displaystyle \phantom{\ln p(y)} \;=\; \int q(\theta)\,\ln\!\left[\frac{p(y,\theta)}{p(\theta\mid y)}\cdot\frac{q(\theta)}{q(\theta)}\right]d\theta$
Multiply and divide by $q(\theta)$ inside the log

$\displaystyle \phantom{\ln p(y)} \;=\; \int q(\theta)\,\ln\frac{q(\theta)}{p(\theta\mid y)}\,d\theta \;+\; \int q(\theta)\,\ln\frac{p(y,\theta)}{q(\theta)}\,d\theta$
Split the log of a product

$\displaystyle \phantom{\ln p(y)} \;=\; \underbrace{\mathrm{KL}\!\bigl[q\,\Vert\,p(\cdot\mid y)\bigr]}_{\color{#b8412a}\text{divergence}\geq 0} \;+\; \underbrace{F(q,y)}_{\color{#1f4a8c}\text{free energy}}$
Read off the two pieces
Figure 1b · Step through the variational identity
$q(\theta)$ true posterior KL $F$

Two consequences follow:

This is the picture from the slides: log-evidence is a fixed ceiling, $F$ rises toward it as we optimize, and the leftover gap is exactly the KL.

3. Visualizing the decomposition

Take a Bayesian inference problem with a closed-form posterior, so we have a ground truth to compare against. Model:

$$ \theta \sim \mathcal{N}(\mu_0, \sigma_0^2),\qquad y \mid \theta \sim \mathcal{N}(\theta, \sigma^2_{\!\text{lik}}). $$

With one observation $y$, the true posterior is $p(\theta\mid y) = \mathcal{N}(\mu^\star, \sigma^{\star 2})$ with $\sigma^{\star 2}=(1/\sigma_0^2+1/\sigma^2_{\!\text{lik}})^{-1}$ and $\mu^\star = \sigma^{\star 2}(\mu_0/\sigma_0^2 + y/\sigma^2_{\!\text{lik}})$. We pick a variational family $q(\theta)=\mathcal{N}(\mu_q,\sigma_q^2)$ and watch the identity $\ln p(y) = \mathrm{KL}[q\Vert p(\cdot\mid y)] + F(q,y)$ hold for every choice of $(\mu_q,\sigma_q)$, even bad ones.

The bar on the right of the figure shows the decomposition. The ceiling $\ln p(y)$ is constant (it depends on the data and model, not on $q$). As you move $q$ closer to the true posterior, the red KL band shrinks and the blue free-energy band fills in to meet it. The "Optimize" button does a gradient ascent on $F(q,y)$; you'll see $q$ settle onto the posterior.

Figure 2 · $\ln p(y) = \mathrm{KL}[q\Vert p(\cdot\mid y)] + F(q,y)$
prior $p(\theta)$ likelihood $p(y\mid\theta)$ true posterior $p(\theta\mid y)$ variational $q(\theta)$

4. Two ways to read the free energy

The free energy admits a second decomposition that is often more useful for computation. Starting from its definition,

$$ F(q,y) = \int q(\theta)\,\ln\frac{p(y,\theta)}{q(\theta)}\,d\theta = \underbrace{\mathbb{E}_q[\ln p(y,\theta)]}_{\text{expected log-joint}} + \underbrace{\mathrm{H}[q]}_{\text{entropy of }q}. $$

Maximizing $F$ trades two pressures:

Equivalently, and this is the form people optimize in practice, $F(q,y) = \mathbb{E}_q[\ln p(y\mid\theta)] - \mathrm{KL}[q\Vert p(\theta)]$: fit the data, but stay close to the prior.

The two contributions appear side by side as you change $q$. Watch the trade-off: shrinking $\sigma_q$ raises the fit term (if $\mu_q$ is in the right place) but lowers the entropy. The optimum balances them.

Figure 3 · $F = \mathbb{E}_q[\ln p(y,\theta)] + \mathrm{H}[q]$
Figure 4 · Mean-field VI underestimates correlated posterior variance
true correlated posterior axis-aligned mean-field $q_1q_2$ marginal variance kept by reverse KL

The heatmap below shows $F(q,y)$ as a function of $(\mu_q, \sigma_q)$ for the same Gaussian-Gaussian model. Click anywhere on the landscape to place $q$ there; the optimum's $(\mu^\star,\sigma^\star)$ is marked with a crosshair, an arrow shows the gradient direction at your current $q$, and the inset on the right plots that $q$ against the true posterior on the $\theta$ axis. The readout reports $F$, the constant ceiling $\ln p(y)$, and their gap, which is exactly $\mathrm{KL}[q\Vert p(\cdot\mid y)]$. As you slide $y$ or the likelihood noise, the whole landscape shifts.

Figure 5 · ELBO landscape over variational parameters
higher ELBO / posterior marker optimizer trajectory / current $q$ $\nabla F$ direction

Where this goes next

$\ln p(y) = \mathrm{KL}[q\Vert p(\cdot\mid y)] + F(q,y)$. Everything from the mean-field updates of a topic model to the loss function of a VAE is a tactic for making one side of that equation easy to compute.

What next

Variational inference sits between measure-theoretic identities and sampling-based computation.