Free Energy & Variational Inference
Bayesian inference asks: given data $y$ and a model $p(y,\theta) = p(y\mid\theta)\,p(\theta)$, what is the posterior $p(\theta\mid y)$? In principle you just apply Bayes: $p(\theta\mid y) = p(y,\theta)/p(y)$. In practice the marginal evidence $p(y) = \int p(y,\theta)\,d\theta$ is a high-dimensional integral that is almost never available in closed form. Variational inference sidesteps the integral by replacing "find the posterior" with "find the closest tractable distribution to the posterior", turning inference into optimization.
The closeness measure is the Kullback-Leibler divergence, and the quantity we actually optimize is the variational free energy (also known, with a sign flip, as the ELBO). Together they sit inside the single identity
$$ \ln p(y) \;=\; \underbrace{\mathrm{KL}\!\bigl[\,q(\theta)\,\Vert\,p(\theta\mid y)\,\bigr]}_{\geq 0} \;+\; F(q,y). $$This page uses KL as a building block. For the standalone intuition behind KL, including categorical examples and forward-vs-reverse behavior, see KL Divergence. Three interactive figures below show how the same directed gap becomes an inference algorithm.
1. KL inside variational inference
For two densities $q$ and $p$ on the same space, the KL divergence is
$$ \mathrm{KL}[q\,\Vert\,p] \;=\; \int q(\theta)\,\ln\frac{q(\theta)}{p(\theta)}\,d\theta \;=\; \mathbb{E}_q\!\left[\ln\frac{q(\theta)}{p(\theta)}\right]. $$It is non-negative, zero only when $q=p$ almost everywhere, and directed: $\mathrm{KL}[q\Vert p] \neq \mathrm{KL}[p\Vert q]$ in general. Variational inference uses $\mathrm{KL}[q\Vert p(\theta\mid y)]$, the reverse or mode-seeking direction, because that is the one that drops out of the algebra below.
Drag $q$ and $p$. The shaded gap between the curves on the bottom plot is the integrand $q(\theta)\ln\bigl(q(\theta)/p(\theta)\bigr)$, weighted by $q$. The point for VI is that changing $q$ changes both where the approximation puts mass and where the mismatch is measured.
2. The variational identity
Start from Bayes' rule, $p(y,\theta) = p(\theta\mid y)\,p(y)$, take logs, and play the classic multiply-and-divide-by-$q(\theta)$ trick:
Two consequences follow:
- Because $\ln p(y)$ depends on $q$ only through the right-hand side, and KL is non-negative, $F(q,y) \le \ln p(y)$. The free energy is a lower bound on the log evidence, the "Evidence Lower BOund" (ELBO).
- Because $\ln p(y)$ is constant in $q$, maximizing $F(q,y)$ is equivalent to minimizing $\mathrm{KL}[q\Vert p(\cdot\mid y)]$. We have turned an intractable integral into a tractable optimization.
This is the picture from the slides: log-evidence is a fixed ceiling, $F$ rises toward it as we optimize, and the leftover gap is exactly the KL.
3. Visualizing the decomposition
Take a Bayesian inference problem with a closed-form posterior, so we have a ground truth to compare against. Model:
$$ \theta \sim \mathcal{N}(\mu_0, \sigma_0^2),\qquad y \mid \theta \sim \mathcal{N}(\theta, \sigma^2_{\!\text{lik}}). $$With one observation $y$, the true posterior is $p(\theta\mid y) = \mathcal{N}(\mu^\star, \sigma^{\star 2})$ with $\sigma^{\star 2}=(1/\sigma_0^2+1/\sigma^2_{\!\text{lik}})^{-1}$ and $\mu^\star = \sigma^{\star 2}(\mu_0/\sigma_0^2 + y/\sigma^2_{\!\text{lik}})$. We pick a variational family $q(\theta)=\mathcal{N}(\mu_q,\sigma_q^2)$ and watch the identity $\ln p(y) = \mathrm{KL}[q\Vert p(\cdot\mid y)] + F(q,y)$ hold for every choice of $(\mu_q,\sigma_q)$, even bad ones.
The bar on the right of the figure shows the decomposition. The ceiling $\ln p(y)$ is constant (it depends on the data and model, not on $q$). As you move $q$ closer to the true posterior, the red KL band shrinks and the blue free-energy band fills in to meet it. The "Optimize" button does a gradient ascent on $F(q,y)$; you'll see $q$ settle onto the posterior.
4. Two ways to read the free energy
The free energy admits a second decomposition that is often more useful for computation. Starting from its definition,
$$ F(q,y) = \int q(\theta)\,\ln\frac{p(y,\theta)}{q(\theta)}\,d\theta = \underbrace{\mathbb{E}_q[\ln p(y,\theta)]}_{\text{expected log-joint}} + \underbrace{\mathrm{H}[q]}_{\text{entropy of }q}. $$Maximizing $F$ trades two pressures:
- The expected log-joint $\mathbb{E}_q[\ln p(y,\theta)]$ pulls $q$ toward regions where the model thinks the data are likely, where high prior meets high likelihood. It is a "fit" term.
- The entropy $\mathrm{H}[q] = -\int q\ln q$ pushes $q$ to spread out. It is a "don't be overconfident" term.
Equivalently, and this is the form people optimize in practice, $F(q,y) = \mathbb{E}_q[\ln p(y\mid\theta)] - \mathrm{KL}[q\Vert p(\theta)]$: fit the data, but stay close to the prior.
The two contributions appear side by side as you change $q$. Watch the trade-off: shrinking $\sigma_q$ raises the fit term (if $\mu_q$ is in the right place) but lowers the entropy. The optimum balances them.
The heatmap below shows $F(q,y)$ as a function of $(\mu_q, \sigma_q)$ for the same Gaussian-Gaussian model. Click anywhere on the landscape to place $q$ there; the optimum's $(\mu^\star,\sigma^\star)$ is marked with a crosshair, an arrow shows the gradient direction at your current $q$, and the inset on the right plots that $q$ against the true posterior on the $\theta$ axis. The readout reports $F$, the constant ceiling $\ln p(y)$, and their gap, which is exactly $\mathrm{KL}[q\Vert p(\cdot\mid y)]$. As you slide $y$ or the likelihood noise, the whole landscape shifts.
Where this goes next
- Mean-field variational inference. If $q(\theta) = \prod_i q_i(\theta_i)$ factorizes, coordinate ascent on $F$ has a closed-form update for each $q_i$. This is VI for graphical models, topic models, latent Dirichlet allocation, and more.
- Amortized inference / VAE. Replace $q(\theta)$ with $q_\phi(\theta\mid y)$, a neural network mapping data to variational parameters. Maximize the same $F$, now over $\phi$ and over the generative model. That is the variational autoencoder.
- Free-energy principle. The same identity, applied at every level of a hierarchical generative model of sensory input, gives Karl Friston's account of perception, action, and learning as minimizing free energy.
- EM as a special case. The Expectation–Maximization algorithm alternates E-steps (set $q$ to the exact posterior, KL = 0) and M-steps (maximize $F$ over model parameters) on the same identity.
$\ln p(y) = \mathrm{KL}[q\Vert p(\cdot\mid y)] + F(q,y)$. Everything from the mean-field updates of a topic model to the loss function of a VAE is a tactic for making one side of that equation easy to compute.
What next
Variational inference sits between measure-theoretic identities and sampling-based computation.