Monte Carlo & Bayesian Sampling Methods

Rejection, importance sampling, Metropolis-Hastings, Gibbs, RJMCMC, Kalman filters, and particle filters: what they share, where they differ, which to reach for, and what their probability mass is doing.

Bayesian computation centres on two integrals: the posterior normalizer $p(y) = \int p(y\mid\theta)p(\theta)\,d\theta$, and a posterior expectation $\mathbb{E}_{p(\theta\mid y)}[f(\theta)] = \int f(\theta)\,p(\theta\mid y)\,d\theta$. Neither has a closed form for most models of interest. Monte Carlo methods replace the integrals with averages over samples, either drawn directly (i.i.d. samplers), or drawn from a Markov chain whose stationary distribution is the target (MCMC), or drawn sequentially over time (SMC).

1. The big picture

Every method below evaluates an expectation against some target $\pi(x)$, typically a Bayesian posterior $\pi(x) = p(\theta\mid y)$, possibly known only up to a constant. They differ in three axes:

How samples are produced. Direct i.i.d. draws (inverse CDF, rejection), reweighted draws from a tractable proposal (importance sampling), draws from a Markov chain converging to $\pi$ (Metropolis-Hastings, Gibbs, RJMCMC), or sequentially-updated weighted particles (SMC / particle filters).
What they require of $\pi$. Inverse CDF needs the CDF; rejection needs a global envelope; importance sampling needs only pointwise $\pi(x)$; MCMC needs only ratios $\pi(x')/\pi(x)$; Gibbs needs tractable full conditionals; Kalman needs linearity and Gaussianity.
What state space they're built for. Static $\theta$ (most posteriors), variable-dimension $(k,\theta_k)$ (RJMCMC for model selection), or a time-evolving latent state $x_{0:T}$ (Kalman, particle filters).

The "What to use" table at the bottom of this page collapses the choice into one view.

Figure 1 · How the sampler creates probability mass

Structure	Sampling idea	Use when
Static target	Direct / rejection	Low-dimensional target with an envelope or inverse transform.
Weighted target	Importance sampling	A tractable proposal covers the target and the goal is an expectation.
Markov target	MCMC	Only unnormalized target ratios are available, often in high dimension.
State over time	Sequential Monte Carlo	Latent state evolves and observations arrive one time step at a time.

target $\pi$ proposal / transition weighted particles

2. I.i.d. methods

exacti.i.d. Inverse-CDF sampling

The cleanest case. If $X$ has CDF $F$, then $F(X)\sim\mathcal{U}[0,1]$, so $X = F^{-1}(U)$ for $U\sim\mathcal{U}[0,1]$ recovers exact i.i.d. samples.

$$ X = F^{-1}(U),\qquad U\sim\mathcal{U}[0,1]. $$

It is exact and free of rejection but useless beyond 1-D and beyond targets whose CDF has an invertible closed form. It is the baseline that motivates everything else.

exacti.i.d. Rejection sampling

Pick a tractable proposal $q$ and a constant $M$ with $M\,q(x)\geq \pi(x)$ for all $x$. Draw $x\sim q$, $u\sim\mathcal{U}[0,1]$, and accept if $u \lt \pi(x)/(M\,q(x))$. Accepted samples are exact draws from $\pi$.

Rejection sampling

Draw $x \sim q$.
Draw $u \sim \mathcal{U}[0,1]$.
If $u \lt \pi(x) / (M\,q(x))$, accept $x$; else reject and repeat.

Acceptance rate is $1/M$. In high dimensions $M$ blows up exponentially, because the gap between any plausible $q$ and a peaked $\pi$ widens fast in volume, so rejection is unusable past a handful of dimensions.

i.i.d. Importance sampling (IS)

Never reject. Reweight every sample from $q$ by the ratio $w(x) = \pi(x)/q(x)$:

$$ \mathbb{E}_\pi[f(X)] \approx \frac{1}{N}\sum_{i=1}^N w(x^{(i)})\,f(x^{(i)}),\qquad x^{(i)}\sim q. $$

When $\pi$ is unnormalized (Bayesian posteriors typically are), use self-normalized IS:

$$ \tilde w^{(i)} = \frac{w(x^{(i)})}{\sum_j w(x^{(j)})},\qquad \hat{\mathbb{E}}_\pi[f] = \sum_i \tilde w^{(i)} f(x^{(i)}). $$

The optimal proposal minimizing variance of $\hat{\mathbb{E}}_\pi[f]$ is $q^*(x)\propto |f(x)|\pi(x)$, which is exactly the thing you can't sample from. The practical advice is: choose $q$ heavier-tailed than $\pi$ so that weights don't blow up, and monitor the effective sample size $\mathrm{ESS} = (\sum_i \tilde w^{(i)})^2 / \sum_i (\tilde w^{(i)})^2$. If it collapses to $\ll N$, your weights are degenerate and the estimate is dominated by one or two samples.

Figure 2 lets you compare rejection and importance sampling on the same problem. Pick a target $\pi$ and a proposal $q$ (family, center, scale, and for Student-$t$ the degrees of freedom $\nu$). The top row plots each method's draws; the ESS bar shows how many useful samples per $N$ draws each method extracts; the empirical-fit strip overlays the resulting histograms on $\pi$ so you can see which method actually reconstructs the target. The preset row sweeps through the regimes where the two methods agree, disagree, or both fail.

Figure 2 · Rejection sampling vs. importance sampling: choose $p$ and $q$

target $\pi$ proposal $M\!\cdot\!q$ accepted / rejection ESS importance weights / IS ESS rejected / wasted draws

target $p$ Bimodal mixture

proposal $q$ family Gaussian

proposal center $\mu_q$ 0

proposal scale $\sigma_q$ 2.0

Student-t df $\nu$ 3

# samples 300

3. MCMC

Once $\pi$ is multivariate and peaked, both rejection (acceptance rate $\to 0$) and importance sampling (weight variance $\to \infty$) collapse. MCMC trades i.i.d. samples for correlated samples from a Markov chain $X_0, X_1, X_2, \dots$ whose stationary distribution is $\pi$. The chain is constructed to satisfy detailed balance:

$$ \pi(x)\,T(x\to x') \;=\; \pi(x')\,T(x'\to x), $$

which is sufficient for $\pi$ to be invariant under $T$. Different MCMC algorithms are different ways to engineer such a $T$.

The MCMC story is a sequence of relaxations. Each step removes one limitation, but it adds a new requirement or a new way to fail.

Step	What changes	Requirement	What can fail
Metropolis	Replace i.i.d. draws with a Markov chain and accept/reject local proposals.	Symmetric proposal $q(x'\mid x)=q(x\mid x')$.	Proposal scale can make the chain crawl or reject almost everything.
Metropolis-Hastings	Allow asymmetric or independence proposals.	Evaluate the proposal correction $q(x\mid x')/q(x'\mid x)$.	A bad proposal still gives sticky chains or mode blindness.
Gibbs	Replace accept/reject proposals with exact coordinate conditional draws.	Closed-form full conditionals, or MH-within-Gibbs substeps.	Strong posterior correlation makes axis moves mix slowly.
RJMCMC	Let the chain jump between parameter spaces with different dimensions.	Dimension-matched reversible moves and Jacobian terms.	Birth/death move design is problem-specific and can mix poorly.

Figure 3 puts the MCMC variants on the same toy state space. The tabs keep the Markov-chain skeleton fixed where possible, then add the extra mechanism that each variant needs: rejection self-loops, Hastings proposal correction, coordinate-wise conditionals, birth/death dimension moves, or gradient-guided proposals. The RJMCMC tab shares the 1-D state space with the others for visual parallelism — the states there stand in for model orders $k$, not for the within-model parameters. The genuine two-space picture (within-model MH on each $k$, plus dimension-jumping moves between them) is in Figure 13.

Figure 3 · MCMC variant lab: what changes in the transition kernel

target mass proposal / transition accepted move rejection / self-loop

variant knob 0.55

speed (moves/sec) 2.5

mcmc Metropolis MCMC (1953)

Metropolis MCMC is the symmetric-proposal member of the Metropolis-Hastings family. Use a proposal with $q(x'\mid x) = q(x\mid x')$, propose $x'\sim q(\cdot\mid x)$, and accept with probability

$$ \alpha(x,x') = \min\!\left\{1,\;\frac{\pi(x')}{\pi(x)}\right\}. $$

The symmetric $q$ cancels, so only the target-density ratio remains. Typical choice: random walk $x' = x + \epsilon$, $\epsilon\sim\mathcal{N}(0,\Sigma_\text{prop})$. The step size $\Sigma_\text{prop}$ is the one tuning knob and it matters: too small and the chain crawls; too large and nearly every proposal is rejected.

mcmc Metropolis-Hastings (1970)

Generalizes Metropolis to asymmetric $q$ by adding a correction:

$$ \alpha(x,x') = \min\!\left\{1,\;\frac{\pi(x')\,q(x\mid x')}{\pi(x)\,q(x'\mid x)}\right\}. $$

Metropolis-Hastings

Propose $x' \sim q(\cdot \mid x)$.
Compute $\alpha = \min\bigl\{1, \frac{\pi(x')\,q(x \mid x')}{\pi(x)\,q(x' \mid x)}\bigr\}$.
Draw $u \sim \mathcal{U}[0,1]$. If $u \lt \alpha$, set $x \leftarrow x'$; else keep $x$.

Variants worth naming:

Random-walk MH: $q$ symmetric; ratio collapses to $\pi(x')/\pi(x)$. Default in most papers.
Independent MH: $q(x'\mid x) = q(x')$. Equivalent to importance sampling with hard accept/reject; works only if $q$ is genuinely a good global approximation of $\pi$.
Langevin / HMC: $q$ uses gradient information $\nabla\log\pi$ to bias proposals toward high-density regions. Much faster in high dimensions when gradients are available.

Why that $\alpha$? Because it is exactly the factor that forces detailed balance. Write the transition as $T(x\to x') = q(x'\mid x)\,\alpha(x,x')$. The proposal alone is unbalanced: the two flows $\pi(x)\,q(x'\mid x)$ and $\pi(x')\,q(x\mid x')$ generally differ. One direction is over-full, and its $\alpha$ is the ratio that scales it down while the reverse direction's $\alpha$ is $1$. Either way both realized flows land on $\min\{\pi(x)\,q(x'\mid x),\;\pi(x')\,q(x\mid x')\}$, so $\pi(x)\,T(x\to x') = \pi(x')\,T(x'\to x)$ — detailed balance. Figure 4 lets you push the target and proposal ratios around: the proposal flow tilts, $\alpha$ compensates, and the realized flow stays balanced no matter what.

Figure 4 · The acceptance ratio forces detailed balance

proposal flow $\pi\,q$ over-full direction, throttled by $\alpha$ realized flow $\pi\,T$

target ratio $\pi(x')/\pi(x)$ 2

proposal ratio $q(x\mid x')/q(x'\mid x)$ 1

Figure 5 runs a random-walk MH on a 2-D banana-shaped target and lets you watch how step size trades off acceptance rate against mixing.

Figure 5 · Random-walk Metropolis-Hastings on a banana target

step size $\sigma$ 0.5

speed (steps/frame) 5

Figure 6 · Trace, rolling autocorrelation, and effective sample size

trace rolling ACF ESS estimate

step size $\sigma$ 0.6

Figure 7 · Random-walk MH versus Hamiltonian proposals

random-walk MH leapfrog trajectory accepted HMC move

leapfrog step $\epsilon$ 0.08

leapfrog steps 16

speed (steps/sec) 2

The BNN posterior target swaps the banana for the two-parameter $f(x; w_1, w_2) = w_1\tanh(w_2 x)$ posterior used on the Bayesian neural networks page. The model has a sign symmetry, so the energy surface has two basins of equal depth. Watch HMC commit to one basin while random-walk MH stays trapped in whichever it started in — switching basins requires crossing a high-energy ridge that local proposals rarely traverse.

One caveat the per-step view hides: cost. An HMC step here spends $L$ gradient evaluations on its leapfrog trajectory (the leapfrog steps slider) where random-walk MH spends none. Per accepted step HMC wins easily; per gradient evaluation the gap narrows, and the honest comparison is mixing per unit of compute — which is why the strong-but-expensive proposal still has to earn its keep against the cheap-but-slow one.

For a live MALA sampler on this same banana target, see chi-feng's MCMC demo. MALA — the Metropolis-adjusted Langevin algorithm — is HMC with a single leapfrog step: one gradient-guided Langevin proposal, then an accept/reject correction. It sits between random-walk MH and full HMC.

Figure 8 makes MALA's anatomy explicit. Every proposed step on the banana target $x' = x + d + s$ splits into a deterministic drift $d = \tfrac{\epsilon^2}{2}\nabla\log\pi(x)$ that climbs the gradient and a random diffusion kick $s = \epsilon\,z$ with $z\sim\mathcal{N}(0, I)$. Because the proposal is asymmetric, the accept ratio carries both forward and reverse Langevin densities. Watch the acceptance gauge: MALA mixes best near a rate of $0.574$, so the gauge doubles as a step-size tuning instrument.

MALA is the corrected discretization of a continuous process. The Langevin diffusion $d\theta = \tfrac12\nabla\log\pi(\theta)\,dt + dB$ has $\pi$ as its exact stationary distribution. Take a finite Euler step of size $\epsilon$ and you get the unadjusted Langevin algorithm (ULA): fast, but the discretization biases the stationary distribution away from $\pi$. MALA removes that bias with the Metropolis accept/reject step. Toggle the correction off to run ULA. The histogram strip tracks the chain's $x$-marginal against the true $\mathcal{N}(0,4)$: at large $\epsilon$ the uncorrected chain visibly settles to the wrong distribution while MALA stays on target.

Figure 8 · Langevin diffusion, ULA, and MALA

drift $d = \tfrac{\epsilon^2}{2}\nabla\log\pi$ diffusion $s = \epsilon z$ rejected proposal

step size $\epsilon$ 0.45

speed (steps/sec) 3

Figure 9 shows the same target as a surface. Drag to rotate it. The ridge shape makes the random-walk tuning problem visible: local proposals can slide along the ridge, but large proposals jump off the high-density mass.

Figure 9 · WebGL density surface for the banana target

high target density low target density current MH state

mcmc Markov topology and stationary mass

The Markov chain is not just a path. It is a transition graph over the state space: each state sends probability mass to nearby proposals, then the accept/reject rule changes how much of that mass actually moves. A good chain has local moves that still let empirical occupation mass converge to the target mass. The mode buttons switch the state-space layout: a 1-D line (axis-aligned random walk on a bimodal target), a 2-D grid (Moore-neighborhood random walk on a two-bump target), and a ring with a barrier between two clusters of mass — each exposes a different mixing failure mode.

Figure 10 · Discrete MH transition topology and occupation mass

target mass transition edges empirical mass

proposal radius 2

temperature 1.0

speed (moves/sec) 2

background samples 25

mcmc Gibbs sampling

A special case of MH where the proposal is the full conditional of one coordinate at a time. With this proposal, the acceptance probability is identically 1.

$$ x_i^{(t+1)} \sim p\!\left(x_i \,\middle|\, x_1^{(t+1)},\ldots,x_{i-1}^{(t+1)},\,x_{i+1}^{(t)},\ldots,x_d^{(t)}\right). $$

Gibbs is the workhorse of Bayesian hierarchical modeling because conjugate priors give closed-form full conditionals, so every update is a draw from a named distribution. It mixes poorly along directions of high posterior correlation (because it can only move along axes); a common cure is to update blocks of variables jointly, or to reparametrize. If a conditional is intractable, replace that step with an MH sub-step. This is Metropolis-within-Gibbs.

Figure 11 shows why Gibbs feels different from a random walk: it moves by exact conditional draws along coordinate axes. That makes every proposal accepted, but strong posterior correlation can still make the chain advance slowly.

Figure 11 · Gibbs coordinate updates versus random-walk MH

target contours Gibbs path MH path

correlation $\rho$ 0.85

MH step size 0.9

mcmc Reversible-jump MCMC (RJMCMC)

What if the model order itself is unknown: number of mixture components, number of change-points, number of hidden layers? Then the state space is a disjoint union $\bigsqcup_k \{k\}\times\mathbb{R}^{n_k}$ of spaces of different dimension, and standard MH doesn't apply because $\pi(x')/\pi(x)$ is comparing densities in different-dimensional spaces.

Green's RJMCMC fixes this with a dimension-matching auxiliary variable. To move from $(k,\theta_k)$ to $(k',\theta_{k'})$ with $n_{k'}>n_k$, draw auxiliary $u\sim g$ and apply a diffeomorphism $\theta_{k'} = T(\theta_k, u)$. The acceptance ratio carries a Jacobian:

$$ \alpha = \min\!\left\{1,\; \frac{\pi(k',\theta_{k'})\,g'(u')\,r(k'\to k)}{\pi(k,\theta_k)\,g(u)\,r(k\to k')}\left|\frac{\partial(\theta_{k'},u')}{\partial(\theta_k,u)}\right|\right\}. $$

The classic example is a mixture model with a birth/death pair: birth splits one component into two ($\theta_1=\theta+u,\theta_2=\theta-u$, Jacobian $=2$); death merges two into one. Recent work uses RJMCMC inside simulated annealing to choose neural network architectures jointly with weights.

Figure 12 · Dimension-matching map: split / merge

$k=1$ parameter $\theta = \mu$ $k=2$ parameters $(\mu_1, \mu_2)$ auxiliary $u$

mean $\mu$ (in $k=1$) 0.5

auxiliary $u$ 1.2

Figure 13 puts the split/merge map inside a Markov chain. The left panel runs within-model Metropolis-Hastings on the $k=1$ parameter $\mu$; the right panel runs MH on the $k=2$ parameters $(\mu_1, \mu_2)$. With probability $p_\text{rj}$ the chain proposes a between-model move instead: a birth $\mu \mapsto (\mu+u, \mu-u)$ with $u\sim g$, or a death that merges $(\mu_1, \mu_2)$ to their midpoint. The acceptance ratio for each between-model move carries the Jacobian factor $2$ from the diffeomorphism. The model-order bar below the panels shows the chain's marginal occupation of $\{k=1, k=2\}$.

Figure 13 · RJMCMC as two coupled chains

target density current state recent trail RJ proposal

between-model proposal probability $p_{rj}$ 0.30

speed (steps/sec) 12

4. Closed-form filtering: the Kalman family

A different family of problems involves a hidden state that evolves over time. The state-space model is

$$ x_k = f(x_{k-1}) + w_k,\qquad y_k = h(x_k) + v_k, $$

where $w_k$ and $v_k$ are process and observation noise. The job is to compute $p(x_k \mid y_{1:k})$ sequentially, filtering the noisy observations.

The filtering story is the same kind of tradeoff ladder: start with an exact Gaussian posterior, then relax linearity, then relax the local-linearization requirement, then finally relax the Gaussian-belief restriction.

Step	What changes	Requirement	What can fail
Kalman	Compute the filtering posterior exactly as mean and covariance.	Linear dynamics, linear observations, Gaussian noise.	Nonlinear or non-Gaussian systems violate the model.
EKF	Allow nonlinear dynamics/observations by linearizing locally.	Jacobians and a posterior narrow enough for local linearization.	Curvature over the uncertainty region can bias or destabilize the filter.
UKF	Push sigma points through the nonlinear functions instead of using Jacobians.	A unimodal Gaussian belief is still a good summary.	Multimodal or heavy-tailed posteriors get collapsed into one Gaussian.
Particle filter	Represent the posterior with weighted samples instead of one Gaussian.	Enough particles, likelihood evaluations, and resampling control.	Weight collapse, sample impoverishment, and poor scaling with state dimension.

closed-form Kalman filter (linear-Gaussian)

If $f$ and $h$ are linear ($x_k = Fx_{k-1} + w_k,\ y_k = Hx_k + v_k$) and the noises are Gaussian, the filtering distribution stays Gaussian forever and is given exactly by:

$$ \begin{aligned} \hat x_k^- &= F\hat x_{k-1}, \quad P_k^- = F P_{k-1} F^\top + Q & \text{(predict)} \\ K_k &= P_k^- H^\top (H P_k^- H^\top + R)^{-1} & \text{(gain)} \\ \hat x_k &= \hat x_k^- + K_k(y_k - H\hat x_k^-) \\ P_k &= (I - K_k H)\,P_k^- & \text{(update)} \end{aligned} $$

This is not a Monte Carlo method; it is the closed-form posterior for the linear-Gaussian special case. Use it when you can.

closed-form Extended Kalman filter (EKF)

For mildly nonlinear $f,h$, linearize at the current mean: $F_k = \partial f/\partial x\big|_{\hat x_{k-1}}$, $H_k = \partial h/\partial x\big|_{\hat x_k^-}$, then apply the standard Kalman equations. Cheap and widely deployed, but a poor approximation when the local linearization disagrees with the true function over the spread of the prior.

closed-form Unscented Kalman filter (UKF)

Instead of linearizing the function, propagate a deterministic set of $2n+1$ sigma points through the true nonlinear $f,h$ and refit a Gaussian to their image. Second-order accurate, requires no Jacobians, same cost as EKF. Still fundamentally Gaussian, and fails for multimodal or heavy-tailed posteriors.

Figure 14 compares the filtering approximations on one nonlinear update. Kalman-style methods keep only a Gaussian belief; the difference is how they push that Gaussian through nonlinear dynamics and observations.

Figure 14 · Kalman, EKF, and UKF through a nonlinear observation

true posterior shape EKF Gaussian UKF Gaussian

observation $y$ 1.2

observation noise 0.3

prior spread 1.0

5. Sequential Monte Carlo

When the state-space model is genuinely nonlinear or non-Gaussian, no closed form exists and Gaussian approximations break. SMC represents $p(x_{0:k}\mid y_{1:k})$ by a weighted cloud of particles $\{(x^{(i)}_{0:k}, w^{(i)}_k)\}_{i=1}^N$ and updates the cloud as each new observation arrives.

smc Sequential importance sampling (SIS)

Apply IS step by step. With proposal $\pi(x_k\mid x_{0:k-1}, y_{1:k})$, the weights update recursively:

$$ w_k^{(i)} \propto w_{k-1}^{(i)} \cdot \frac{p(y_k\mid x_k^{(i)})\,p(x_k^{(i)}\mid x_{k-1}^{(i)})}{\pi(x_k^{(i)}\mid x_{0:k-1}^{(i)}, y_{1:k})}. $$

The catastrophe: after a few steps almost all weight ends up on one particle. This is weight degeneracy, and it is the reason SIS by itself is essentially never used.

smc Particle filter / SIR (bootstrap)

Add a resampling step: whenever the effective sample size drops below a threshold, draw $N$ new particles with replacement from the weighted cloud, duplicating heavy particles and killing light ones, and reset all weights to $1/N$. The simplest variant, the bootstrap filter, uses the prior $p(x_k\mid x_{k-1}^{(i)})$ as the proposal, which collapses the weight update to just the likelihood:

$$ w_k^{(i)} \propto p(y_k\mid x_k^{(i)}). $$

Bootstrap particle filter

For each particle $i$: draw $x_k^{(i)} \sim p(x_k \mid x_{k-1}^{(i)})$ from the dynamics.
Reweight: $w_k^{(i)} \propto p(y_k \mid x_k^{(i)})$.
Normalize weights; compute ESS.
If ESS < threshold: resample $N$ particles $\propto w_k^{(i)}$ and reset weights to $1/N$.

Variants improve on the bootstrap proposal: the optimal proposal $\pi(x_k\mid x_{k-1}^{(i)}, y_k)$ minimizes weight variance but is tractable only for special models; local-linearization proposals use an EKF or UKF on each particle to construct a locally near-optimal Gaussian proposal. Resampling itself has variants (multinomial, stratified, systematic, residual) that differ in variance.

Resampling cures weight degeneracy but introduces sample impoverishment: heavy particles get duplicated and the cloud loses diversity. Roughening (jittering particles by a small noise) and MCMC moves between resampling steps are the standard remedies.

Figure 15 runs SIS and a bootstrap particle filter on the same nonlinear tracking problem, so the weight-collapse failure and the resampling fix are visible side by side.

Figure 15 · Bootstrap particle filter on a nonlinear scalar SSM

true state observation $y_k$ particle cloud posterior mean

# particles 200

obs noise $\sigma_v$ 1.0

resample threshold (ESS/N) 0.5

Figure 16 · Particle genealogy after repeated resampling

surviving lineages extinct lineages

resampling pressure 0.65

6. What to use where

The table below collapses the previous five sections into a one-glance reference. Tags echo the section colors: exact means returns-true-samples-from-$\pi$, i.i.d. means draws are independent, mcmc means correlated chain samples, closed-form means no sampling at all, smc means weighted particles over time.

Method	Use when	Strengths	Pitfalls
Inverse CDF exacti.i.d.	Univariate target with invertible CDF (exponential, Cauchy, custom 1-D).	Exact, zero waste.	Useless beyond 1-D or without closed-form $F^{-1}$.
Rejection exacti.i.d.	Low-dimensional target with a known envelope $M\,q\geq\pi$.	Exact i.i.d. samples; simple to implement.	Acceptance rate $1/M$ collapses exponentially in dimension.
Importance sampling i.i.d.	Computing expectations under an unnormalized target; building block inside SMC.	Never rejects; self-normalized version handles unnormalized $\pi$; flexible $q$.	Weight degeneracy if $q$ poorly covers $\pi$ or in high $d$; monitor ESS.
Metropolis mcmc	Static posterior where a symmetric random-walk proposal is adequate.	Simpler acceptance ratio; needs only pointwise unnormalized $\pi$.	A special case of MH; proposal scale still controls mixing and rejection.
Metropolis-Hastings mcmc	General-purpose static posterior with no special structure; only ratios $\pi(x')/\pi(x)$ available.	Asymptotically exact; needs only pointwise unnormalized $\pi$.	Tuning the proposal scale is fiddly; mixes poorly in high $d$ or multimodal $\pi$.
Gibbs sampling mcmc	Hierarchical/conjugate models where every full conditional has closed form.	Zero tuning; acceptance rate 1; natural for graphical models.	Slow when variables are correlated; needs tractable conditionals (else MH-within-Gibbs).
RJMCMC mcmc	Joint inference over model order and parameters (mixture components, change-points, NN architecture).	Asymptotically exact across model space; principled Bayesian model selection.	Move design and Jacobians are problem-specific; mixing across dimensions is hard.
Kalman filter closed-form	Linear-Gaussian state-space model. Filtering / smoothing of $x_{0:T}$ given $y_{1:T}$.	Exact; closed-form $O(d^3)$ per step.	Restricted to linear dynamics and Gaussian noise.
EKF closed-form	Mildly nonlinear state-space; tracking, robotics, navigation.	Drop-in replacement for KF; cheap.	First-order linearization error; can diverge under strong nonlinearity.
UKF closed-form	More nonlinear state-space than EKF can handle but still unimodal Gaussian.	Second-order accurate; no Jacobians required.	Still a Gaussian approximation; fails for multimodal / heavy-tailed posteriors.
Particle filter (SIR) smc	Nonlinear, non-Gaussian state-space; tracking, robotics, gene networks, anything multimodal over time.	Asymptotically exact for fully general SSMs; handles multimodal posteriors.	Sample impoverishment after resampling; scales poorly with state dimension.

Quick decision tree.

Linear dynamics and Gaussian noise? → Kalman filter. Mildly nonlinear? → EKF/UKF. Genuinely nonlinear/non-Gaussian/multimodal? → particle filter.
Static target (no time dimension), low-D, envelope available? → rejection.
Static target, want an expectation, have a reasonable proposal $q$? → importance sampling.
Static target, high-D, only ratios of $\pi$? → symmetric proposal: Metropolis; asymmetric proposal: Metropolis-Hastings. Conjugate conditionals available? → Gibbs. Gradients available? → HMC/Langevin.
Model order itself is unknown? → RJMCMC.

What next

Sampling connects most directly to change of measure, approximate inference, and dependence testing.

Foundations

Measure Theory & Random Variables

Importance sampling is the Radon–Nikodym derivative identity turned into an estimator.

Approximation

Free Energy & Variational Inference

A complementary route to Bayesian computation: optimize a tractable approximation instead of drawing samples.

Likelihood

Uninformative Priors, Fisher Information & MLE

See how likelihood curvature, Jeffreys priors, and MLE shape the targets these samplers explore.

Processes

Markov Chains

Separate the general theory of transition matrices, mixing, recurrence, and CTMCs from their MCMC use here.

Testing

Distance Correlation

See another place resampling appears: permutation tests for nonlinear dependence.