Hypothesis Testing

Overlapping null and alternative distributions, significance, power, and the cost of a decision threshold.

Classical hypothesis testing compares a statistic against a threshold. Under the null hypothesis $H_0$, crossing the threshold is a false positive with probability $\alpha$. Under the alternative $H_1$, failing to cross is a false negative with probability $\beta$, and the power is $1-\beta$. Below the surface, every panel on this page is a different view of the same pair of distributions $(P_0, P_1)$.

1. The α/β overlap

Figure 1 · The α/β overlap diagram
null $H_0$ alternative $H_1$ type-I error $\alpha$ type-II error $\beta$

Moving the threshold right reduces $\alpha$ but increases $\beta$. Increasing the effect size separates the curves and improves power without changing the threshold. This is the core visual tradeoff behind p-values, power analysis, and sample-size planning.

2. The likelihood-ratio view

Underneath the overlap diagram is the likelihood ratio $\Lambda(x) = p_1(x)/p_0(x)$. The Neyman–Pearson lemma says that for testing simple $H_0$ vs.\ simple $H_1$, the most powerful test at level $\alpha$ rejects when $\Lambda(x) > k$. For two unit-variance normals with means $0$ and $\mu$, $\log\Lambda(x) = \mu x - \mu^2/2$ — a straight line. Sliding the threshold horizontally on the log-ratio plot is equivalent to sliding the threshold on $x$; the rejection region $\{x : \Lambda(x) > k\}$ is recovered by inversion. This view makes the measure-theoretic story explicit: the rule depends on the data only through $\Lambda$, which is a Radon–Nikodym derivative $dP_1/dP_0$.

Figure 2 · Likelihood ratio Λ(x)
$\log\Lambda(x)$ threshold $\log k$ $\alpha$ (over $P_0$) $\beta$ (over $P_1$)

Two asymptotic facts about $\log\Lambda$ are worth naming, because they connect this view to other pages without needing their own figures. Under $H_0$ the mean of $\log\Lambda$ is $-D(P_0\|P_1)$ and under $H_1$ it is $+D(P_1\|P_0)$, so accumulating $\log\Lambda$ across $n$ independent samples gives Stein's lemma: holding $\alpha$ fixed, the optimal type-II error decays as $\beta_n \approx e^{-n\,D(P_0\|P_1)}$ — KL is the exponent of power (KL divergence). And Wilks' theorem gives the null distribution of the test statistic itself: $-2\log\Lambda_n \xrightarrow{d} \chi^2_k$ under $H_0$, which is why $\chi^2$ tables turn up everywhere in practice (modes of convergence).

3. The ROC curve

Sweeping the threshold traces out the ROC curve: every operating point is a pair $(\alpha,\,1-\beta)$. The curve is a measure-theoretic invariant of the pair $(P_0, P_1)$ — any monotone reparametrization of the test statistic gives the same curve. The diagonal is the no-information baseline (a random coin flip). The area under the curve, $\mathrm{AUC} = P(\Lambda(X_1) > \Lambda(X_0))$, is a single number summarizing separability. For two unit-variance Gaussians with mean gap $\mu$, $\mathrm{AUC} = \Phi(\mu/\sqrt{2})$.

Figure 3 · ROC curve
ROC curve chance diagonal operating point

4. Bayes risk and the optimal threshold

Hand the test a prior $\pi_1$ on $H_1$ and a loss ratio $c_I/c_{II}$ (the cost of a false alarm relative to a missed detection) and the right thing to minimize is the Bayes risk $\;R(t) = \pi_0\,c_I\,\alpha(t) + \pi_1\,c_{II}\,\beta(t)$. Setting $dR/dt = 0$ gives the optimum: $\Lambda(x^\ast) = (\pi_0 c_I)/(\pi_1 c_{II})$ — the optimal Bayes threshold is a likelihood-ratio cut at the prior-weighted loss ratio. Frequentist Neyman–Pearson and Bayesian decision theory pick out the same rule, with the threshold set by either $\alpha$ or by prior+loss.

Figure 4 · Bayes risk vs. threshold
$R(t)$ current threshold Bayes optimum $t^\ast$

What next