Entropy & Mutual Information

Entropy measures uncertainty; mutual information measures uncertainty removed by another variable.

Entropy measures uncertainty before you observe an outcome. Mutual information measures how much uncertainty about one variable disappears after you observe another. Both are measured in bits when we use log base 2.

1. Entropy: average surprise

A rare event is more surprising than a common one. Information theory writes the surprise of an event with probability $p$ as $-\log_2 p$ bits. Entropy is the probability-weighted average surprise:

$$H(X) = - \sum_x p(x)\log_2 p(x)$$

For a yes/no variable, entropy is zero when the answer is certain and one bit when both answers are equally likely. Move the slider to see the curve.

Binary Entropy
Entropy peaks when neither outcome is predictable.
Probability and surprise move in opposite directions.
H(X)
1.000
bits per observation
Surprise if 0 occurs
1.000
bits
Surprise if 1 occurs
1.000
bits

Entropy is the average surprise over many draws. Sample from the current distribution and watch the running mean of $-\log_2 p(x)$ approach $H(X)$. Adjust the slider above to change $p$; both lines on the trace will adjust live.

Sampling: surprise converges to entropy
0 draws
Each dot above is one draw (blue = 0, green = 1). The dark line below is the running mean surprise; the orange line is $H(X)$.
Avg. surprise
bits / sample observed
H(X)
1.000
bits / sample expected
Frequency of 1
vs. p = 0.50
Entropy is not disorder in the everyday sense. It is the expected number of yes/no questions needed to identify the outcome, assuming an ideal code.

2. Joint and conditional entropy

With two variables, $H(X,Y)$ measures uncertainty about the pair. Conditional entropy $H(Y\mid X)$ measures what remains uncertain about $Y$ once $X$ is known.

$$H(X,Y) = H(X) + H(Y\mid X) = H(Y) + H(X\mid Y)$$

The next figure uses a simple channel: $X$ is a bit, and $Y$ is a noisy copy of $X$. At zero noise, knowing $X$ tells you $Y$ exactly. At 50% noise, $Y$ is just a fresh coin flip.

Noisy Copy Channel
Joint distribution $p(x,y)$.
The shared part is mutual information.
H(X)
1.000
uncertainty in input
H(Y)
1.000
uncertainty in output
H(Y|X)
0.680
noise left after X
H(X,Y)
1.680
uncertainty in pair
Binary symmetric channel graph

The binary symmetric channel sends a bit across two straight arrows with probability $1-\epsilon$ and across two crossover arrows with probability $\epsilon$.

H(Y|X)
0.680
bits lost to noise
Capacity
0.320
bits per channel use
Limit case
partial
clean to useless

3. Mutual information

Mutual information is the overlap between the uncertainty in $X$ and the uncertainty in $Y$. It is the amount you learn about one variable by observing the other:

$$I(X;Y) = H(X) + H(Y) - H(X,Y) = H(Y) - H(Y\mid X)$$

It is always non-negative and symmetric. It is zero exactly when the variables are independent. For the channel above, the most mutual information is one bit: one clean bit goes in, one clean bit comes out.

Mutual information as KL between two measures

The joint law $P_{XY}$ and the product of marginals $P_XP_Y$ live on the same four-cell outcome space. Mutual information is the KL divergence between them: $I(X;Y)=\mathrm{KL}(P_{XY}\Vert P_XP_Y)$.

KL(PXY ‖ PX PY)
0.000
bits
I(X;Y)
0.000
same number
largest cell contribution
0.000
bits
Information Venn diagram

The $H(X)$ circle and the $H(Y)$ circle overlap by exactly $I(X;Y)$. The crescents are the conditional entropies $H(X\mid Y)$ and $H(Y\mid X)$; the union is the joint entropy $H(X,Y)$.

Hover or tap a region (or chip above) to see its formula and current value. Click to lock the highlight; click outside the circles to release it.
I(X;Y)
0.320
mutual information: bits learned about X from Y
I(X;Y)
0.320
bits
Normalized MI
0.320
fraction of H(X)
Channel capacity used
0.32
I(X;Y) / 1 bit (max)

The output is a noisy copy, so observing it removes part of the input uncertainty but not all of it.

Nonlinear dependence: correlation can vanish while MI remains
Pearson r
0.000
linear association
estimated MI
0.000
binned estimate, bits
reading
dependence type
Channel capacity sweep

For a binary symmetric channel with flip probability $\epsilon$, capacity is $1-H_2(\epsilon)$ bits. Biased inputs usually leave some capacity unused.

The upper curve is channel capacity. The lower curve is the information transmitted by the selected input distribution.

4. Entropy is the convex dual of log-sum-exp

Entropy doesn't only measure uncertainty. It has a precise mathematical role as the convex conjugate of the log-partition function. For any function $f(x)$ on a measurable space, the variational identity says:

$$ \log\!\int e^{f(x)}\,dx \;=\; \sup_q\!\left(\mathbb{E}_q[f(X)] + H(q)\right), $$

with equality achieved by the Gibbs distribution $q^\ast(x) \propto e^{f(x)}$. Read left to right: the log of a normalizing integral equals the best achievable tradeoff between "tracking $f$" and "keeping $q$ spread out." Read right to left: maximizing $\mathbb{E}_q[f] + H(q)$ forces an exponential answer.

This identity does real work elsewhere on the site. Three examples:

  • Exponential families. Set $f(x) = \eta^\top T(x)$. The left side becomes the log-partition $A(\eta)$ on the Fisher information page; the supremum is achieved by the exp-family member with natural parameter $\eta$. Convexity of $A$ is the same statement as positivity of Fisher information.
  • Maximum-entropy priors. Treat the supremum as a primal problem with moment constraints; the multipliers are the natural parameters. This is the route to the max-entropy section.
  • Variational inference. Set $f(\theta) = \log p(y,\theta)$. The left side is $\log p(y)$; the evidence is a log-partition function. The supremum is the ELBO, achieved when $q$ equals the posterior.
Log-sum-exp ↔ entropy as a Legendre pair
$A(\eta) = \log\sum_k e^{\eta k}$ on $\{0,1,\dots,K-1\}$. The tangent at $\eta$ has slope $A'(\eta) = \mathbb{E}_q[k]$.
Conjugate $A^*(\mu) = \sup_\eta(\eta\mu - A(\eta))$. On the simplex it equals $-H(q^*)$ at the matching $q^*$.
A(η)
log-partition
μ = E_q[k] = A′(η)
mean of sufficient stat
H(q*)
entropy at optimum
A*(μ) = η·μ − A(η)
= −H(q*)

5. What to remember

Entropy is uncertainty in a variable. Conditional entropy is uncertainty that remains after another variable is known. Mutual information is the uncertainty removed by that knowledge.

Correlation asks whether two numbers move together in a particular geometric way. Mutual information asks whether observing one variable changes the distribution you should assign to the other.

What next

These pages pick up the same ideas from nearby angles.