Entropy & Mutual Information
Entropy measures uncertainty before you observe an outcome. Mutual information measures how much uncertainty about one variable disappears after you observe another. Both are measured in bits when we use log base 2.
1. Entropy: average surprise
A rare event is more surprising than a common one. Information theory writes the surprise of an event with probability $p$ as $-\log_2 p$ bits. Entropy is the probability-weighted average surprise:
For a yes/no variable, entropy is zero when the answer is certain and one bit when both answers are equally likely. Move the slider to see the curve.
Entropy is the average surprise over many draws. Sample from the current distribution and watch the running mean of $-\log_2 p(x)$ approach $H(X)$. Adjust the slider above to change $p$; both lines on the trace will adjust live.
2. Joint and conditional entropy
With two variables, $H(X,Y)$ measures uncertainty about the pair. Conditional entropy $H(Y\mid X)$ measures what remains uncertain about $Y$ once $X$ is known.
The next figure uses a simple channel: $X$ is a bit, and $Y$ is a noisy copy of $X$. At zero noise, knowing $X$ tells you $Y$ exactly. At 50% noise, $Y$ is just a fresh coin flip.
The binary symmetric channel sends a bit across two straight arrows with probability $1-\epsilon$ and across two crossover arrows with probability $\epsilon$.
3. Mutual information
Mutual information is the overlap between the uncertainty in $X$ and the uncertainty in $Y$. It is the amount you learn about one variable by observing the other:
It is always non-negative and symmetric. It is zero exactly when the variables are independent. For the channel above, the most mutual information is one bit: one clean bit goes in, one clean bit comes out.
The joint law $P_{XY}$ and the product of marginals $P_XP_Y$ live on the same four-cell outcome space. Mutual information is the KL divergence between them: $I(X;Y)=\mathrm{KL}(P_{XY}\Vert P_XP_Y)$.
The $H(X)$ circle and the $H(Y)$ circle overlap by exactly $I(X;Y)$. The crescents are the conditional entropies $H(X\mid Y)$ and $H(Y\mid X)$; the union is the joint entropy $H(X,Y)$.
The output is a noisy copy, so observing it removes part of the input uncertainty but not all of it.
For a binary symmetric channel with flip probability $\epsilon$, capacity is $1-H_2(\epsilon)$ bits. Biased inputs usually leave some capacity unused.
4. Entropy is the convex dual of log-sum-exp
Entropy doesn't only measure uncertainty. It has a precise mathematical role as the convex conjugate of the log-partition function. For any function $f(x)$ on a measurable space, the variational identity says:
with equality achieved by the Gibbs distribution $q^\ast(x) \propto e^{f(x)}$. Read left to right: the log of a normalizing integral equals the best achievable tradeoff between "tracking $f$" and "keeping $q$ spread out." Read right to left: maximizing $\mathbb{E}_q[f] + H(q)$ forces an exponential answer.
This identity does real work elsewhere on the site. Three examples:
- Exponential families. Set $f(x) = \eta^\top T(x)$. The left side becomes the log-partition $A(\eta)$ on the Fisher information page; the supremum is achieved by the exp-family member with natural parameter $\eta$. Convexity of $A$ is the same statement as positivity of Fisher information.
- Maximum-entropy priors. Treat the supremum as a primal problem with moment constraints; the multipliers are the natural parameters. This is the route to the max-entropy section.
- Variational inference. Set $f(\theta) = \log p(y,\theta)$. The left side is $\log p(y)$; the evidence is a log-partition function. The supremum is the ELBO, achieved when $q$ equals the posterior.
5. What to remember
Entropy is uncertainty in a variable. Conditional entropy is uncertainty that remains after another variable is known. Mutual information is the uncertainty removed by that knowledge.
Correlation asks whether two numbers move together in a particular geometric way. Mutual information asks whether observing one variable changes the distribution you should assign to the other.
What next
These pages pick up the same ideas from nearby angles.