Sufficient Statistics

When a statistic preserves all information about the parameter under a model.

A sufficient statistic $T(X)$ for a parameter $\theta$ is a function of the sample that preserves all information about $\theta$ under the model. Once you know $T(X)$, the remaining variation in the data does not depend on $\theta$.

$T(X)$ is sufficient for $\theta$ if the conditional distribution $p(X \mid T(X), \theta)$ doesn't depend on $\theta$. The parameter has finished doing its job by the time you reach $T$.
There is also a Bayesian version: $T$ is sufficient iff $p(\theta \mid X) = p(\theta \mid T(X))$ for every prior. Posterior inference from the raw sample equals posterior inference from the summary. Conditional independence ($X \perp \theta \mid T$), Fisher–Neyman factorization, and posterior equivalence all pick out the same maps $T$.

1. The compression hook

Twenty samples from $\mathcal{N}(\theta, 1)$ on the left, just the sample mean $\bar x$ on the right. Two very different-looking pictures. The likelihood curves they produce (the only thing inference about $\theta$ depends on) are identical, up to a constant.

Figure 1 · The same likelihood from very different data
raw samples sample mean $\bar x$ (sufficient summary) likelihood from raw data likelihood from $\bar x$ alone

The blue and dashed-red curves don't drift apart as you move the slider, because the full sample and the sample mean carry the same information about $\theta$. You can throw away $n-1$ residual dimensions and the likelihood doesn't notice. The rest of the page is about why that works, and when it doesn't.

2. The Fisher–Neyman factorization criterion

The operational test for sufficiency: $T$ is sufficient for $\theta$ if and only if the joint density factors as $$p(x_{1:n} \mid \theta) \;=\; g\bigl(T(x); \theta\bigr) \cdot h(x).$$ The $\theta$-dependent factor $g$ talks to the data only through $T(x)$; the residual $h(x)$ talks about the data but knows nothing about $\theta$.

Figure 2 · Color-coded factorization for several families
$g(T(x); \theta)$: depends on $\theta$ $h(x)$: $\theta$-free residual $\theta$-dependence that won't factor through any $T$

For most named families, $T$ pops out cleanly. For Cauchy(θ, 1) (a location family with heavy tails) there is no $g$/$h$ split at all. The order statistic is sufficient, but it has the same dimension as the data: no compression. That is not bad luck. §7 explains why.

3. The fiber picture

Sufficiency has a geometric content. Each value of $T$ defines a fiber in sample space: the set of samples that share that $T$-value. Sufficiency says: along each fiber, the conditional density is the same shape regardless of $\theta$. Moving $\theta$ slides the cloud of probability mass from fiber to fiber, but never reshapes any one fiber.

The discrete version is easy to see: Bernoulli samples $(1,0,1,1,0)$ and $(0,1,1,0,1)$ both have $T = \sum x_i = 3$. Their likelihoods are identical ($\theta^3(1-\theta)^2$ either way), so for inference about $\theta$ they're indistinguishable. Together with $\binom{5}{3} - 2 = 8$ other length-5 sequences with three 1s, they form one fiber: the equivalence class of all samples that produce the same likelihood curve. The statistic $T$ is the label on that class. The figure below shows the continuous analogue.

Figure 3 · Two samples, level sets of T = x₁ + x₂, and the θ-free conditional
samples (x₁, x₂) fibers: x₁ + x₂ = const highlighted fiber conditional density of (x₁−x₂)/√2 (θ-free)

Move $\theta$: the blue cloud slides along the diagonal $x_1 = x_2$, raising the probability mass on some fibers and lowering it on others. Move the highlighted fiber to a different value of $T$: you pick a different slice through the cloud. But the conditional density of the residual $(x_1 - x_2)/\sqrt 2$ along any fiber is $\mathcal{N}(0, 1)$, no matter what $\theta$ is. The fibers are the orbits of an action that absorbs every trace of $\theta$.

4. The operational test: does the conditional move?

The defining property is that $p(X \mid T(X), \theta)$ doesn't depend on $\theta$. One way to check: generate many samples, condition on $T(X) = t$, and see whether the resulting empirical distribution shifts when we change $\theta$. For a sufficient statistic it doesn't. For an insufficient one, it does. The visible drift is the information leak.

Figure 4 · Conditional distribution of the residual under two candidates
conditional p(x₁ | T(x) = t, θ) for θ low same conditional for θ high aggregated samples after conditioning

For $T = \sum x_i$ the blue and red bars sit on top of each other no matter how far apart you push the two $\theta$ values: the conditional distribution of the residual is purely combinatorial (uniform over bit configurations with the given sum) and has no $\theta$ left. For $T = x_1$ or $T = \max(x_i)$, the conditional shifts as $\theta$ moves: those statistics are insufficient.

5. Compression ratio and minimal sufficiency

Sufficient statistics can be wasteful: the entire sample is trivially sufficient for itself. The interesting question is minimal sufficiency: the coarsest summary that still preserves the likelihood. Below, the compression ratio $n \to \dim(T)$ for several named families.

Figure 5 · How much each family compresses n samples
raw sample dimension $n$ dimension of minimal sufficient $T$ no finite reduction
Pitman–Koopman–Darmois. Under regularity conditions (the support doesn't depend on $\theta$, and the family is suitably smooth), a finite-dimensional sufficient statistic exists for all $n$ only for exponential families. So the Cauchy row above is a structural impossibility, not an oversight. The Uniform$(0,\theta)$ row is the opposite kind of edge case: it does compress to $\max(x_i)$, but the support depends on $\theta$, which violates the theorem's regularity. PKD applies cleanly only to the exponential-family rows. §7 makes the converse direction visible.

6. Rao–Blackwell: variance collapse

Rao–Blackwell says: take any unbiased estimator $\hat\theta(X)$, replace it with its conditional expectation given a sufficient statistic, and the result is unbiased and has lower (or equal) variance. The variance decomposition $$\operatorname{Var}(\hat\theta) \;=\; \mathbb{E}[\operatorname{Var}(\hat\theta \mid T)] + \operatorname{Var}(\mathbb{E}[\hat\theta \mid T])$$ splits the variance into what conditioning on $T$ discards and what it keeps. The first piece is gone for free.

Figure 6 · Bad estimator → Rao–Blackwellized estimator
samples of $\hat p = x_1$ samples of $\mathbb{E}[x_1 \mid \sum x_i] = \bar x$ true $p$

Both estimators are unbiased for $p$; their histograms sit over the same true value. But the bad estimator (which discards $n - 1$ samples) has variance $p(1-p)$ regardless of $n$, while the Rao–Blackwellized one has variance $p(1-p)/n$. The factor $n$ is the variance you got back by conditioning on a sufficient statistic.

7. Exponential families: where sufficiency comes from

Every family in §5 with finite-dimensional $T$ is an exponential family: it can be written as $$p(x \mid \theta) \;=\; h(x)\,\exp\!\Bigl[\eta(\theta)^\top T(x) - A(\theta)\Bigr].$$ The factorization theorem is built in: $T(x)$ is staring at you in the exponent. Pitman–Koopman–Darmois is the converse: under regularity, this is the only way finite-dimensional sufficiency happens.

The boldfaced part of each formula is the inner product $\eta(\theta)^\top T(x)$, the only place where the parameter and the data ever meet. That algebraic narrowness is what lets a finite-dimensional summary absorb all the parameter dependence. Cauchy doesn't fit this template at all.

8. The flow diagram: one picture, three currencies

The generative model is always a Markov chain $\theta \to X \to T(X)$: $\theta$ generates the data, and $T$ is a deterministic function of the data. The data-processing inequality then says $I(\theta;T) \le I(\theta;X)$ no matter what $T$ you pick. A summary can never carry more information about the parameter than the raw sample.

Sufficiency is exactly the case where that inequality is tight. Equivalently: the reverse chain $\theta \to T \to X$ also holds, because $X \perp \theta \mid T$. Both chains running at once force $I(\theta;T) = I(\theta;X)$. $I$ is the strongest currency, since Fisher information is its local ($\Delta\theta \to 0$) limit and the likelihood and variance statements both follow.

Factorization, Rao–Blackwell, and the data-processing inequality are three readings of that one tightness condition. The sparklines on each edge of the diagram tell you which currency you're spending.

Figure 7 · X → T(X) → inference, in three flavors
full data path via $T$ path $\theta$-free residual edge content (sparkline)

Likelihood: the two paths $X \to L$ and $X \to T \to L$ produce the same curve. Variance: the path through $T$ keeps only the part of the variance that depends on $\theta$; the rest is in $h$, gone for free, Rao–Blackwell. Fisher information: $I_T(\theta) \le I_X(\theta)$ with equality iff $T$ is sufficient. This is the data-processing inequality, and sufficiency is the case where it's tight. One commuting square; whether it commutes is the theorem.

What next