Sufficient Statistics
A sufficient statistic $T(X)$ for a parameter $\theta$ is a function of the sample that preserves all information about $\theta$ under the model. Once you know $T(X)$, the remaining variation in the data does not depend on $\theta$.
1. The compression hook
Twenty samples from $\mathcal{N}(\theta, 1)$ on the left, just the sample mean $\bar x$ on the right. Two very different-looking pictures. The likelihood curves they produce (the only thing inference about $\theta$ depends on) are identical, up to a constant.
The blue and dashed-red curves don't drift apart as you move the slider, because the full sample and the sample mean carry the same information about $\theta$. You can throw away $n-1$ residual dimensions and the likelihood doesn't notice. The rest of the page is about why that works, and when it doesn't.
2. The Fisher–Neyman factorization criterion
The operational test for sufficiency: $T$ is sufficient for $\theta$ if and only if the joint density factors as $$p(x_{1:n} \mid \theta) \;=\; g\bigl(T(x); \theta\bigr) \cdot h(x).$$ The $\theta$-dependent factor $g$ talks to the data only through $T(x)$; the residual $h(x)$ talks about the data but knows nothing about $\theta$.
For most named families, $T$ pops out cleanly. For Cauchy(θ, 1) (a location family with heavy tails) there is no $g$/$h$ split at all. The order statistic is sufficient, but it has the same dimension as the data: no compression. That is not bad luck. §7 explains why.
3. The fiber picture
Sufficiency has a geometric content. Each value of $T$ defines a fiber in sample space: the set of samples that share that $T$-value. Sufficiency says: along each fiber, the conditional density is the same shape regardless of $\theta$. Moving $\theta$ slides the cloud of probability mass from fiber to fiber, but never reshapes any one fiber.
The discrete version is easy to see: Bernoulli samples $(1,0,1,1,0)$ and $(0,1,1,0,1)$ both have $T = \sum x_i = 3$. Their likelihoods are identical ($\theta^3(1-\theta)^2$ either way), so for inference about $\theta$ they're indistinguishable. Together with $\binom{5}{3} - 2 = 8$ other length-5 sequences with three 1s, they form one fiber: the equivalence class of all samples that produce the same likelihood curve. The statistic $T$ is the label on that class. The figure below shows the continuous analogue.
Move $\theta$: the blue cloud slides along the diagonal $x_1 = x_2$, raising the probability mass on some fibers and lowering it on others. Move the highlighted fiber to a different value of $T$: you pick a different slice through the cloud. But the conditional density of the residual $(x_1 - x_2)/\sqrt 2$ along any fiber is $\mathcal{N}(0, 1)$, no matter what $\theta$ is. The fibers are the orbits of an action that absorbs every trace of $\theta$.
4. The operational test: does the conditional move?
The defining property is that $p(X \mid T(X), \theta)$ doesn't depend on $\theta$. One way to check: generate many samples, condition on $T(X) = t$, and see whether the resulting empirical distribution shifts when we change $\theta$. For a sufficient statistic it doesn't. For an insufficient one, it does. The visible drift is the information leak.
For $T = \sum x_i$ the blue and red bars sit on top of each other no matter how far apart you push the two $\theta$ values: the conditional distribution of the residual is purely combinatorial (uniform over bit configurations with the given sum) and has no $\theta$ left. For $T = x_1$ or $T = \max(x_i)$, the conditional shifts as $\theta$ moves: those statistics are insufficient.
5. Compression ratio and minimal sufficiency
Sufficient statistics can be wasteful: the entire sample is trivially sufficient for itself. The interesting question is minimal sufficiency: the coarsest summary that still preserves the likelihood. Below, the compression ratio $n \to \dim(T)$ for several named families.
6. Rao–Blackwell: variance collapse
Rao–Blackwell says: take any unbiased estimator $\hat\theta(X)$, replace it with its conditional expectation given a sufficient statistic, and the result is unbiased and has lower (or equal) variance. The variance decomposition $$\operatorname{Var}(\hat\theta) \;=\; \mathbb{E}[\operatorname{Var}(\hat\theta \mid T)] + \operatorname{Var}(\mathbb{E}[\hat\theta \mid T])$$ splits the variance into what conditioning on $T$ discards and what it keeps. The first piece is gone for free.
Both estimators are unbiased for $p$; their histograms sit over the same true value. But the bad estimator (which discards $n - 1$ samples) has variance $p(1-p)$ regardless of $n$, while the Rao–Blackwellized one has variance $p(1-p)/n$. The factor $n$ is the variance you got back by conditioning on a sufficient statistic.
7. Exponential families: where sufficiency comes from
Every family in §5 with finite-dimensional $T$ is an exponential family: it can be written as $$p(x \mid \theta) \;=\; h(x)\,\exp\!\Bigl[\eta(\theta)^\top T(x) - A(\theta)\Bigr].$$ The factorization theorem is built in: $T(x)$ is staring at you in the exponent. Pitman–Koopman–Darmois is the converse: under regularity, this is the only way finite-dimensional sufficiency happens.
The boldfaced part of each formula is the inner product $\eta(\theta)^\top T(x)$, the only place where the parameter and the data ever meet. That algebraic narrowness is what lets a finite-dimensional summary absorb all the parameter dependence. Cauchy doesn't fit this template at all.
8. The flow diagram: one picture, three currencies
The generative model is always a Markov chain $\theta \to X \to T(X)$: $\theta$ generates the data, and $T$ is a deterministic function of the data. The data-processing inequality then says $I(\theta;T) \le I(\theta;X)$ no matter what $T$ you pick. A summary can never carry more information about the parameter than the raw sample.
Sufficiency is exactly the case where that inequality is tight. Equivalently: the reverse chain $\theta \to T \to X$ also holds, because $X \perp \theta \mid T$. Both chains running at once force $I(\theta;T) = I(\theta;X)$. $I$ is the strongest currency, since Fisher information is its local ($\Delta\theta \to 0$) limit and the likelihood and variance statements both follow.
Factorization, Rao–Blackwell, and the data-processing inequality are three readings of that one tightness condition. The sparklines on each edge of the diagram tell you which currency you're spending.
Likelihood: the two paths $X \to L$ and $X \to T \to L$ produce the same curve. Variance: the path through $T$ keeps only the part of the variance that depends on $\theta$; the rest is in $h$, gone for free, Rao–Blackwell. Fisher information: $I_T(\theta) \le I_X(\theta)$ with equality iff $T$ is sufficient. This is the data-processing inequality, and sufficiency is the case where it's tight. One commuting square; whether it commutes is the theorem.