Monte Carlo & Bayesian Sampling Methods
Bayesian computation centres on two integrals: the posterior normalizer $p(y) = \int p(y\mid\theta)p(\theta)\,d\theta$, and a posterior expectation $\mathbb{E}_{p(\theta\mid y)}[f(\theta)] = \int f(\theta)\,p(\theta\mid y)\,d\theta$. Neither has a closed form for most models of interest. Monte Carlo methods replace the integrals with averages over samples, either drawn directly (i.i.d. samplers), or drawn from a Markov chain whose stationary distribution is the target (MCMC), or drawn sequentially over time (SMC).
1. The big picture
Every method below evaluates an expectation against some target $\pi(x)$, typically a Bayesian posterior $\pi(x) = p(\theta\mid y)$, possibly known only up to a constant. They differ in three axes:
- How samples are produced. Direct i.i.d. draws (inverse CDF, rejection), reweighted draws from a tractable proposal (importance sampling), draws from a Markov chain converging to $\pi$ (Metropolis-Hastings, Gibbs, RJMCMC), or sequentially-updated weighted particles (SMC / particle filters).
- What they require of $\pi$. Inverse CDF needs the CDF; rejection needs a global envelope; importance sampling needs only pointwise $\pi(x)$; MCMC needs only ratios $\pi(x')/\pi(x)$; Gibbs needs tractable full conditionals; Kalman needs linearity and Gaussianity.
- What state space they're built for. Static $\theta$ (most posteriors), variable-dimension $(k,\theta_k)$ (RJMCMC for model selection), or a time-evolving latent state $x_{0:T}$ (Kalman, particle filters).
The "What to use" table at the bottom of this page collapses the choice into one view.
| Structure | Sampling idea | Use when |
|---|---|---|
| Static target | Direct / rejection | Low-dimensional target with an envelope or inverse transform. |
| Weighted target | Importance sampling | A tractable proposal covers the target and the goal is an expectation. |
| Markov target | MCMC | Only unnormalized target ratios are available, often in high dimension. |
| State over time | Sequential Monte Carlo | Latent state evolves and observations arrive one time step at a time. |
2. I.i.d. methods
exacti.i.d. Inverse-CDF sampling
The cleanest case. If $X$ has CDF $F$, then $F(X)\sim\mathcal{U}[0,1]$, so $X = F^{-1}(U)$ for $U\sim\mathcal{U}[0,1]$ recovers exact i.i.d. samples.
$$ X = F^{-1}(U),\qquad U\sim\mathcal{U}[0,1]. $$It is exact and free of rejection but useless beyond 1-D and beyond targets whose CDF has an invertible closed form. It is the baseline that motivates everything else.
exacti.i.d. Rejection sampling
Pick a tractable proposal $q$ and a constant $M$ with $M\,q(x)\geq \pi(x)$ for all $x$. Draw $x\sim q$, $u\sim\mathcal{U}[0,1]$, and accept if $u \lt \pi(x)/(M\,q(x))$. Accepted samples are exact draws from $\pi$.
- Draw $x \sim q$.
- Draw $u \sim \mathcal{U}[0,1]$.
- If $u \lt \pi(x) / (M\,q(x))$, accept $x$; else reject and repeat.
Acceptance rate is $1/M$. In high dimensions $M$ blows up exponentially, because the gap between any plausible $q$ and a peaked $\pi$ widens fast in volume, so rejection is unusable past a handful of dimensions.
i.i.d. Importance sampling (IS)
Never reject. Reweight every sample from $q$ by the ratio $w(x) = \pi(x)/q(x)$:
$$ \mathbb{E}_\pi[f(X)] \approx \frac{1}{N}\sum_{i=1}^N w(x^{(i)})\,f(x^{(i)}),\qquad x^{(i)}\sim q. $$When $\pi$ is unnormalized (Bayesian posteriors typically are), use self-normalized IS:
$$ \tilde w^{(i)} = \frac{w(x^{(i)})}{\sum_j w(x^{(j)})},\qquad \hat{\mathbb{E}}_\pi[f] = \sum_i \tilde w^{(i)} f(x^{(i)}). $$The optimal proposal minimizing variance of $\hat{\mathbb{E}}_\pi[f]$ is $q^*(x)\propto |f(x)|\pi(x)$, which is exactly the thing you can't sample from. The practical advice is: choose $q$ heavier-tailed than $\pi$ so that weights don't blow up, and monitor the effective sample size $\mathrm{ESS} = (\sum_i \tilde w^{(i)})^2 / \sum_i (\tilde w^{(i)})^2$. If it collapses to $\ll N$, your weights are degenerate and the estimate is dominated by one or two samples.
Figure 2 lets you compare rejection and importance sampling on the same problem. Pick a target $\pi$ and a proposal $q$ (family, center, scale, and for Student-$t$ the degrees of freedom $\nu$). The top row plots each method's draws; the ESS bar shows how many useful samples per $N$ draws each method extracts; the empirical-fit strip overlays the resulting histograms on $\pi$ so you can see which method actually reconstructs the target. The preset row sweeps through the regimes where the two methods agree, disagree, or both fail.
3. MCMC
Once $\pi$ is multivariate and peaked, both rejection (acceptance rate $\to 0$) and importance sampling (weight variance $\to \infty$) collapse. MCMC trades i.i.d. samples for correlated samples from a Markov chain $X_0, X_1, X_2, \dots$ whose stationary distribution is $\pi$. The chain is constructed to satisfy detailed balance:
$$ \pi(x)\,T(x\to x') \;=\; \pi(x')\,T(x'\to x), $$which is sufficient for $\pi$ to be invariant under $T$. Different MCMC algorithms are different ways to engineer such a $T$.
The MCMC story is a sequence of relaxations. Each step removes one limitation, but it adds a new requirement or a new way to fail.
| Step | What changes | Requirement | What can fail |
|---|---|---|---|
| Metropolis | Replace i.i.d. draws with a Markov chain and accept/reject local proposals. | Symmetric proposal $q(x'\mid x)=q(x\mid x')$. | Proposal scale can make the chain crawl or reject almost everything. |
| Metropolis-Hastings | Allow asymmetric or independence proposals. | Evaluate the proposal correction $q(x\mid x')/q(x'\mid x)$. | A bad proposal still gives sticky chains or mode blindness. |
| Gibbs | Replace accept/reject proposals with exact coordinate conditional draws. | Closed-form full conditionals, or MH-within-Gibbs substeps. | Strong posterior correlation makes axis moves mix slowly. |
| RJMCMC | Let the chain jump between parameter spaces with different dimensions. | Dimension-matched reversible moves and Jacobian terms. | Birth/death move design is problem-specific and can mix poorly. |
Figure 3 puts the MCMC variants on the same toy state space. The tabs keep the Markov-chain skeleton fixed where possible, then add the extra mechanism that each variant needs: rejection self-loops, Hastings proposal correction, coordinate-wise conditionals, birth/death dimension moves, or gradient-guided proposals. The RJMCMC tab shares the 1-D state space with the others for visual parallelism — the states there stand in for model orders $k$, not for the within-model parameters. The genuine two-space picture (within-model MH on each $k$, plus dimension-jumping moves between them) is in Figure 13.
mcmc Metropolis MCMC (1953)
Metropolis MCMC is the symmetric-proposal member of the Metropolis-Hastings family. Use a proposal with $q(x'\mid x) = q(x\mid x')$, propose $x'\sim q(\cdot\mid x)$, and accept with probability
$$ \alpha(x,x') = \min\!\left\{1,\;\frac{\pi(x')}{\pi(x)}\right\}. $$The symmetric $q$ cancels, so only the target-density ratio remains. Typical choice: random walk $x' = x + \epsilon$, $\epsilon\sim\mathcal{N}(0,\Sigma_\text{prop})$. The step size $\Sigma_\text{prop}$ is the one tuning knob and it matters: too small and the chain crawls; too large and nearly every proposal is rejected.
mcmc Metropolis-Hastings (1970)
Generalizes Metropolis to asymmetric $q$ by adding a correction:
$$ \alpha(x,x') = \min\!\left\{1,\;\frac{\pi(x')\,q(x\mid x')}{\pi(x)\,q(x'\mid x)}\right\}. $$- Propose $x' \sim q(\cdot \mid x)$.
- Compute $\alpha = \min\bigl\{1, \frac{\pi(x')\,q(x \mid x')}{\pi(x)\,q(x' \mid x)}\bigr\}$.
- Draw $u \sim \mathcal{U}[0,1]$. If $u \lt \alpha$, set $x \leftarrow x'$; else keep $x$.
Variants worth naming:
- Random-walk MH: $q$ symmetric; ratio collapses to $\pi(x')/\pi(x)$. Default in most papers.
- Independent MH: $q(x'\mid x) = q(x')$. Equivalent to importance sampling with hard accept/reject; works only if $q$ is genuinely a good global approximation of $\pi$.
- Langevin / HMC: $q$ uses gradient information $\nabla\log\pi$ to bias proposals toward high-density regions. Much faster in high dimensions when gradients are available.
Why that $\alpha$? Because it is exactly the factor that forces detailed balance. Write the transition as $T(x\to x') = q(x'\mid x)\,\alpha(x,x')$. The proposal alone is unbalanced: the two flows $\pi(x)\,q(x'\mid x)$ and $\pi(x')\,q(x\mid x')$ generally differ. One direction is over-full, and its $\alpha$ is the ratio that scales it down while the reverse direction's $\alpha$ is $1$. Either way both realized flows land on $\min\{\pi(x)\,q(x'\mid x),\;\pi(x')\,q(x\mid x')\}$, so $\pi(x)\,T(x\to x') = \pi(x')\,T(x'\to x)$ — detailed balance. Figure 4 lets you push the target and proposal ratios around: the proposal flow tilts, $\alpha$ compensates, and the realized flow stays balanced no matter what.
Figure 5 runs a random-walk MH on a 2-D banana-shaped target and lets you watch how step size trades off acceptance rate against mixing.
The BNN posterior target swaps the banana for the two-parameter $f(x; w_1, w_2) = w_1\tanh(w_2 x)$ posterior used on the Bayesian neural networks page. The model has a sign symmetry, so the energy surface has two basins of equal depth. Watch HMC commit to one basin while random-walk MH stays trapped in whichever it started in — switching basins requires crossing a high-energy ridge that local proposals rarely traverse.
One caveat the per-step view hides: cost. An HMC step here spends $L$ gradient evaluations on its leapfrog trajectory (the leapfrog steps slider) where random-walk MH spends none. Per accepted step HMC wins easily; per gradient evaluation the gap narrows, and the honest comparison is mixing per unit of compute — which is why the strong-but-expensive proposal still has to earn its keep against the cheap-but-slow one.
For a live MALA sampler on this same banana target, see chi-feng's MCMC demo. MALA — the Metropolis-adjusted Langevin algorithm — is HMC with a single leapfrog step: one gradient-guided Langevin proposal, then an accept/reject correction. It sits between random-walk MH and full HMC.
Figure 8 makes MALA's anatomy explicit. Every proposed step on the banana target $x' = x + d + s$ splits into a deterministic drift $d = \tfrac{\epsilon^2}{2}\nabla\log\pi(x)$ that climbs the gradient and a random diffusion kick $s = \epsilon\,z$ with $z\sim\mathcal{N}(0, I)$. Because the proposal is asymmetric, the accept ratio carries both forward and reverse Langevin densities. Watch the acceptance gauge: MALA mixes best near a rate of $0.574$, so the gauge doubles as a step-size tuning instrument.
MALA is the corrected discretization of a continuous process. The Langevin diffusion $d\theta = \tfrac12\nabla\log\pi(\theta)\,dt + dB$ has $\pi$ as its exact stationary distribution. Take a finite Euler step of size $\epsilon$ and you get the unadjusted Langevin algorithm (ULA): fast, but the discretization biases the stationary distribution away from $\pi$. MALA removes that bias with the Metropolis accept/reject step. Toggle the correction off to run ULA. The histogram strip tracks the chain's $x$-marginal against the true $\mathcal{N}(0,4)$: at large $\epsilon$ the uncorrected chain visibly settles to the wrong distribution while MALA stays on target.
Figure 9 shows the same target as a surface. Drag to rotate it. The ridge shape makes the random-walk tuning problem visible: local proposals can slide along the ridge, but large proposals jump off the high-density mass.
mcmc Markov topology and stationary mass
The Markov chain is not just a path. It is a transition graph over the state space: each state sends probability mass to nearby proposals, then the accept/reject rule changes how much of that mass actually moves. A good chain has local moves that still let empirical occupation mass converge to the target mass. The mode buttons switch the state-space layout: a 1-D line (axis-aligned random walk on a bimodal target), a 2-D grid (Moore-neighborhood random walk on a two-bump target), and a ring with a barrier between two clusters of mass — each exposes a different mixing failure mode.
mcmc Gibbs sampling
A special case of MH where the proposal is the full conditional of one coordinate at a time. With this proposal, the acceptance probability is identically 1.
$$ x_i^{(t+1)} \sim p\!\left(x_i \,\middle|\, x_1^{(t+1)},\ldots,x_{i-1}^{(t+1)},\,x_{i+1}^{(t)},\ldots,x_d^{(t)}\right). $$Gibbs is the workhorse of Bayesian hierarchical modeling because conjugate priors give closed-form full conditionals, so every update is a draw from a named distribution. It mixes poorly along directions of high posterior correlation (because it can only move along axes); a common cure is to update blocks of variables jointly, or to reparametrize. If a conditional is intractable, replace that step with an MH sub-step. This is Metropolis-within-Gibbs.
Figure 11 shows why Gibbs feels different from a random walk: it moves by exact conditional draws along coordinate axes. That makes every proposal accepted, but strong posterior correlation can still make the chain advance slowly.
mcmc Reversible-jump MCMC (RJMCMC)
What if the model order itself is unknown: number of mixture components, number of change-points, number of hidden layers? Then the state space is a disjoint union $\bigsqcup_k \{k\}\times\mathbb{R}^{n_k}$ of spaces of different dimension, and standard MH doesn't apply because $\pi(x')/\pi(x)$ is comparing densities in different-dimensional spaces.
Green's RJMCMC fixes this with a dimension-matching auxiliary variable. To move from $(k,\theta_k)$ to $(k',\theta_{k'})$ with $n_{k'}>n_k$, draw auxiliary $u\sim g$ and apply a diffeomorphism $\theta_{k'} = T(\theta_k, u)$. The acceptance ratio carries a Jacobian:
$$ \alpha = \min\!\left\{1,\; \frac{\pi(k',\theta_{k'})\,g'(u')\,r(k'\to k)}{\pi(k,\theta_k)\,g(u)\,r(k\to k')}\left|\frac{\partial(\theta_{k'},u')}{\partial(\theta_k,u)}\right|\right\}. $$The classic example is a mixture model with a birth/death pair: birth splits one component into two ($\theta_1=\theta+u,\theta_2=\theta-u$, Jacobian $=2$); death merges two into one. Recent work uses RJMCMC inside simulated annealing to choose neural network architectures jointly with weights.
Figure 13 puts the split/merge map inside a Markov chain. The left panel runs within-model Metropolis-Hastings on the $k=1$ parameter $\mu$; the right panel runs MH on the $k=2$ parameters $(\mu_1, \mu_2)$. With probability $p_\text{rj}$ the chain proposes a between-model move instead: a birth $\mu \mapsto (\mu+u, \mu-u)$ with $u\sim g$, or a death that merges $(\mu_1, \mu_2)$ to their midpoint. The acceptance ratio for each between-model move carries the Jacobian factor $2$ from the diffeomorphism. The model-order bar below the panels shows the chain's marginal occupation of $\{k=1, k=2\}$.
4. Closed-form filtering: the Kalman family
A different family of problems involves a hidden state that evolves over time. The state-space model is
$$ x_k = f(x_{k-1}) + w_k,\qquad y_k = h(x_k) + v_k, $$where $w_k$ and $v_k$ are process and observation noise. The job is to compute $p(x_k \mid y_{1:k})$ sequentially, filtering the noisy observations.
The filtering story is the same kind of tradeoff ladder: start with an exact Gaussian posterior, then relax linearity, then relax the local-linearization requirement, then finally relax the Gaussian-belief restriction.
| Step | What changes | Requirement | What can fail |
|---|---|---|---|
| Kalman | Compute the filtering posterior exactly as mean and covariance. | Linear dynamics, linear observations, Gaussian noise. | Nonlinear or non-Gaussian systems violate the model. |
| EKF | Allow nonlinear dynamics/observations by linearizing locally. | Jacobians and a posterior narrow enough for local linearization. | Curvature over the uncertainty region can bias or destabilize the filter. |
| UKF | Push sigma points through the nonlinear functions instead of using Jacobians. | A unimodal Gaussian belief is still a good summary. | Multimodal or heavy-tailed posteriors get collapsed into one Gaussian. |
| Particle filter | Represent the posterior with weighted samples instead of one Gaussian. | Enough particles, likelihood evaluations, and resampling control. | Weight collapse, sample impoverishment, and poor scaling with state dimension. |
closed-form Kalman filter (linear-Gaussian)
If $f$ and $h$ are linear ($x_k = Fx_{k-1} + w_k,\ y_k = Hx_k + v_k$) and the noises are Gaussian, the filtering distribution stays Gaussian forever and is given exactly by:
$$ \begin{aligned} \hat x_k^- &= F\hat x_{k-1}, \quad P_k^- = F P_{k-1} F^\top + Q & \text{(predict)} \\ K_k &= P_k^- H^\top (H P_k^- H^\top + R)^{-1} & \text{(gain)} \\ \hat x_k &= \hat x_k^- + K_k(y_k - H\hat x_k^-) \\ P_k &= (I - K_k H)\,P_k^- & \text{(update)} \end{aligned} $$This is not a Monte Carlo method; it is the closed-form posterior for the linear-Gaussian special case. Use it when you can.
closed-form Extended Kalman filter (EKF)
For mildly nonlinear $f,h$, linearize at the current mean: $F_k = \partial f/\partial x\big|_{\hat x_{k-1}}$, $H_k = \partial h/\partial x\big|_{\hat x_k^-}$, then apply the standard Kalman equations. Cheap and widely deployed, but a poor approximation when the local linearization disagrees with the true function over the spread of the prior.
closed-form Unscented Kalman filter (UKF)
Instead of linearizing the function, propagate a deterministic set of $2n+1$ sigma points through the true nonlinear $f,h$ and refit a Gaussian to their image. Second-order accurate, requires no Jacobians, same cost as EKF. Still fundamentally Gaussian, and fails for multimodal or heavy-tailed posteriors.
Figure 14 compares the filtering approximations on one nonlinear update. Kalman-style methods keep only a Gaussian belief; the difference is how they push that Gaussian through nonlinear dynamics and observations.
5. Sequential Monte Carlo
When the state-space model is genuinely nonlinear or non-Gaussian, no closed form exists and Gaussian approximations break. SMC represents $p(x_{0:k}\mid y_{1:k})$ by a weighted cloud of particles $\{(x^{(i)}_{0:k}, w^{(i)}_k)\}_{i=1}^N$ and updates the cloud as each new observation arrives.
smc Sequential importance sampling (SIS)
Apply IS step by step. With proposal $\pi(x_k\mid x_{0:k-1}, y_{1:k})$, the weights update recursively:
$$ w_k^{(i)} \propto w_{k-1}^{(i)} \cdot \frac{p(y_k\mid x_k^{(i)})\,p(x_k^{(i)}\mid x_{k-1}^{(i)})}{\pi(x_k^{(i)}\mid x_{0:k-1}^{(i)}, y_{1:k})}. $$The catastrophe: after a few steps almost all weight ends up on one particle. This is weight degeneracy, and it is the reason SIS by itself is essentially never used.
smc Particle filter / SIR (bootstrap)
Add a resampling step: whenever the effective sample size drops below a threshold, draw $N$ new particles with replacement from the weighted cloud, duplicating heavy particles and killing light ones, and reset all weights to $1/N$. The simplest variant, the bootstrap filter, uses the prior $p(x_k\mid x_{k-1}^{(i)})$ as the proposal, which collapses the weight update to just the likelihood:
$$ w_k^{(i)} \propto p(y_k\mid x_k^{(i)}). $$- For each particle $i$: draw $x_k^{(i)} \sim p(x_k \mid x_{k-1}^{(i)})$ from the dynamics.
- Reweight: $w_k^{(i)} \propto p(y_k \mid x_k^{(i)})$.
- Normalize weights; compute ESS.
- If ESS < threshold: resample $N$ particles $\propto w_k^{(i)}$ and reset weights to $1/N$.
Variants improve on the bootstrap proposal: the optimal proposal $\pi(x_k\mid x_{k-1}^{(i)}, y_k)$ minimizes weight variance but is tractable only for special models; local-linearization proposals use an EKF or UKF on each particle to construct a locally near-optimal Gaussian proposal. Resampling itself has variants (multinomial, stratified, systematic, residual) that differ in variance.
Resampling cures weight degeneracy but introduces sample impoverishment: heavy particles get duplicated and the cloud loses diversity. Roughening (jittering particles by a small noise) and MCMC moves between resampling steps are the standard remedies.
Figure 15 runs SIS and a bootstrap particle filter on the same nonlinear tracking problem, so the weight-collapse failure and the resampling fix are visible side by side.
6. What to use where
The table below collapses the previous five sections into a one-glance reference. Tags echo the section colors: exact means returns-true-samples-from-$\pi$, i.i.d. means draws are independent, mcmc means correlated chain samples, closed-form means no sampling at all, smc means weighted particles over time.
| Method | Use when | Strengths | Pitfalls |
|---|---|---|---|
| Inverse CDF exacti.i.d. | Univariate target with invertible CDF (exponential, Cauchy, custom 1-D). | Exact, zero waste. | Useless beyond 1-D or without closed-form $F^{-1}$. |
| Rejection exacti.i.d. | Low-dimensional target with a known envelope $M\,q\geq\pi$. | Exact i.i.d. samples; simple to implement. | Acceptance rate $1/M$ collapses exponentially in dimension. |
| Importance sampling i.i.d. | Computing expectations under an unnormalized target; building block inside SMC. | Never rejects; self-normalized version handles unnormalized $\pi$; flexible $q$. | Weight degeneracy if $q$ poorly covers $\pi$ or in high $d$; monitor ESS. |
| Metropolis mcmc | Static posterior where a symmetric random-walk proposal is adequate. | Simpler acceptance ratio; needs only pointwise unnormalized $\pi$. | A special case of MH; proposal scale still controls mixing and rejection. |
| Metropolis-Hastings mcmc | General-purpose static posterior with no special structure; only ratios $\pi(x')/\pi(x)$ available. | Asymptotically exact; needs only pointwise unnormalized $\pi$. | Tuning the proposal scale is fiddly; mixes poorly in high $d$ or multimodal $\pi$. |
| Gibbs sampling mcmc | Hierarchical/conjugate models where every full conditional has closed form. | Zero tuning; acceptance rate 1; natural for graphical models. | Slow when variables are correlated; needs tractable conditionals (else MH-within-Gibbs). |
| RJMCMC mcmc | Joint inference over model order and parameters (mixture components, change-points, NN architecture). | Asymptotically exact across model space; principled Bayesian model selection. | Move design and Jacobians are problem-specific; mixing across dimensions is hard. |
| Kalman filter closed-form | Linear-Gaussian state-space model. Filtering / smoothing of $x_{0:T}$ given $y_{1:T}$. | Exact; closed-form $O(d^3)$ per step. | Restricted to linear dynamics and Gaussian noise. |
| EKF closed-form | Mildly nonlinear state-space; tracking, robotics, navigation. | Drop-in replacement for KF; cheap. | First-order linearization error; can diverge under strong nonlinearity. |
| UKF closed-form | More nonlinear state-space than EKF can handle but still unimodal Gaussian. | Second-order accurate; no Jacobians required. | Still a Gaussian approximation; fails for multimodal / heavy-tailed posteriors. |
| Particle filter (SIR) smc | Nonlinear, non-Gaussian state-space; tracking, robotics, gene networks, anything multimodal over time. | Asymptotically exact for fully general SSMs; handles multimodal posteriors. | Sample impoverishment after resampling; scales poorly with state dimension. |
- Linear dynamics and Gaussian noise? → Kalman filter. Mildly nonlinear? → EKF/UKF. Genuinely nonlinear/non-Gaussian/multimodal? → particle filter.
- Static target (no time dimension), low-D, envelope available? → rejection.
- Static target, want an expectation, have a reasonable proposal $q$? → importance sampling.
- Static target, high-D, only ratios of $\pi$? → symmetric proposal: Metropolis; asymmetric proposal: Metropolis-Hastings. Conjugate conditionals available? → Gibbs. Gradients available? → HMC/Langevin.
- Model order itself is unknown? → RJMCMC.
What next
Sampling connects most directly to change of measure, approximate inference, and dependence testing.