Hypothesis Testing

Overlapping null and alternative distributions, significance, power, and the cost of a decision threshold.

Classical hypothesis testing compares a statistic against a threshold. Under the null hypothesis $H_0$, crossing the threshold is a false positive with probability $\alpha$. Under the alternative $H_1$, failing to cross is a false negative with probability $\beta$, and the power is $1-\beta$. Below the surface, every panel on this page is a different view of the same pair of distributions $(P_0, P_1)$.

1. The α/β overlap

Figure 1 · The α/β overlap diagram

null $H_0$ alternative $H_1$ type-I error $\alpha$ type-II error $\beta$

effect size 1.4

decision threshold 1.64

Moving the threshold right reduces $\alpha$ but increases $\beta$. Increasing the effect size separates the curves and improves power without changing the threshold. This is the core visual tradeoff behind p-values, power analysis, and sample-size planning.

2. The likelihood-ratio view

Underneath the overlap diagram is the likelihood ratio $\Lambda(x) = p_1(x)/p_0(x)$. The Neyman–Pearson lemma says that for testing simple $H_0$ vs.\ simple $H_1$, the most powerful test at level $\alpha$ rejects when $\Lambda(x) > k$. For two unit-variance normals with means $0$ and $\mu$, $\log\Lambda(x) = \mu x - \mu^2/2$ — a straight line. Sliding the threshold horizontally on the log-ratio plot is equivalent to sliding the threshold on $x$; the rejection region $\{x : \Lambda(x) > k\}$ is recovered by inversion. This view makes the measure-theoretic story explicit: the rule depends on the data only through $\Lambda$, which is a Radon–Nikodym derivative $dP_1/dP_0$.

Figure 2 · Likelihood ratio Λ(x)

$\log\Lambda(x)$ threshold $\log k$ $\alpha$ (over $P_0$) $\beta$ (over $P_1$)

effect size μ 1.4

threshold log k 0.18

Two asymptotic facts about $\log\Lambda$ are worth naming, because they connect this view to other pages without needing their own figures. Under $H_0$ the mean of $\log\Lambda$ is $-D(P_0\|P_1)$ and under $H_1$ it is $+D(P_1\|P_0)$, so accumulating $\log\Lambda$ across $n$ independent samples gives Stein's lemma: holding $\alpha$ fixed, the optimal type-II error decays as $\beta_n \approx e^{-n\,D(P_0\|P_1)}$ — KL is the exponent of power (KL divergence). And Wilks' theorem gives the null distribution of the test statistic itself: $-2\log\Lambda_n \xrightarrow{d} \chi^2_k$ under $H_0$, which is why $\chi^2$ tables turn up everywhere in practice (modes of convergence).

3. The ROC curve

Sweeping the threshold traces out the ROC curve: every operating point is a pair $(\alpha,\,1-\beta)$. The curve is a measure-theoretic invariant of the pair $(P_0, P_1)$ — any monotone reparametrization of the test statistic gives the same curve. The diagonal is the no-information baseline (a random coin flip). The area under the curve, $\mathrm{AUC} = P(\Lambda(X_1) > \Lambda(X_0))$, is a single number summarizing separability. For two unit-variance Gaussians with mean gap $\mu$, $\mathrm{AUC} = \Phi(\mu/\sqrt{2})$.

Figure 3 · ROC curve

ROC curve chance diagonal operating point

effect size μ 1.4

threshold 1.64

4. Bayes risk and the optimal threshold

Hand the test a prior $\pi_1$ on $H_1$ and a loss ratio $c_I/c_{II}$ (the cost of a false alarm relative to a missed detection) and the right thing to minimize is the Bayes risk $\;R(t) = \pi_0\,c_I\,\alpha(t) + \pi_1\,c_{II}\,\beta(t)$. Setting $dR/dt = 0$ gives the optimum: $\Lambda(x^\ast) = (\pi_0 c_I)/(\pi_1 c_{II})$ — the optimal Bayes threshold is a likelihood-ratio cut at the prior-weighted loss ratio. Frequentist Neyman–Pearson and Bayesian decision theory pick out the same rule, with the threshold set by either $\alpha$ or by prior+loss.

Figure 4 · Bayes risk vs. threshold

$R(t)$ current threshold Bayes optimum $t^\ast$

effect size μ 1.4

prior π₁ 0.5

loss ratio c_I / c_II 1

threshold 1.64

What next

Likelihood

Sufficient Statistics

Neyman–Pearson tests depend on the data only through the likelihood ratio — itself a sufficient statistic for {H₀, H₁}.

Divergence

KL Divergence

Stein's lemma: the optimal type-II error decays as exp(−n·D(P₀‖P₁)). KL is the exponent of power.

Foundations

Measure Theory

The likelihood ratio is a Radon–Nikodym derivative; the rejection region is a measurable set under both hypotheses.

Asymptotics

Modes of Convergence

Wilks' theorem and the asymptotic null distributions of LR / Wald / score tests are convergence-in-distribution statements.