Entropy & Mutual Information

Entropy measures uncertainty; mutual information measures uncertainty removed by another variable.

Entropy measures uncertainty before you observe an outcome. Mutual information measures how much uncertainty about one variable disappears after you observe another. Both are measured in bits when we use log base 2.

1. Entropy: average surprise

A rare event is more surprising than a common one. Information theory writes the surprise of an event with probability $p$ as $-\log_2 p$ bits. Entropy is the probability-weighted average surprise:

$$H(X) = - \sum_x p(x)\log_2 p(x)$$

For a yes/no variable, entropy is zero when the answer is certain and one bit when both answers are equally likely. Move the slider to see the curve.

Binary Entropy

Probability of 1 0.50

Entropy peaks when neither outcome is predictable.

Probability and surprise move in opposite directions.

H(X)

1.000

bits per observation

Surprise if 0 occurs

1.000

bits

Surprise if 1 occurs

1.000

bits

Entropy is the average surprise over many draws. Sample from the current distribution and watch the running mean of $-\log_2 p(x)$ approach $H(X)$. Adjust the slider above to change $p$; both lines on the trace will adjust live.

Sampling: surprise converges to entropy

Probability of 1 0.50 0 draws

Each dot above is one draw (blue = 0, green = 1). The dark line below is the running mean surprise; the orange line is $H(X)$.

Avg. surprise

—

bits / sample observed

H(X)

1.000

bits / sample expected

Frequency of 1

—

vs. p = 0.50

Entropy is not disorder in the everyday sense. It is the expected number of yes/no questions needed to identify the outcome, assuming an ideal code.

2. Joint and conditional entropy

With two variables, $H(X,Y)$ measures uncertainty about the pair. Conditional entropy $H(Y\mid X)$ measures what remains uncertain about $Y$ once $X$ is known.

$$H(X,Y) = H(X) + H(Y\mid X) = H(Y) + H(X\mid Y)$$

The next figure uses a simple channel: $X$ is a bit, and $Y$ is a noisy copy of $X$. At zero noise, knowing $X$ tells you $Y$ exactly. At 50% noise, $Y$ is just a fresh coin flip.

Noisy Copy Channel

Probability X = 1 0.50 Flip probability 0.18

Joint distribution $p(x,y)$.

The shared part is mutual information.

H(X)

1.000

uncertainty in input

H(Y)

1.000

uncertainty in output

H(Y|X)

0.680

noise left after X

H(X,Y)

1.680

uncertainty in pair

Binary symmetric channel graph

The binary symmetric channel sends a bit across two straight arrows with probability $1-\epsilon$ and across two crossover arrows with probability $\epsilon$.

Flip probability ε 0.18

H(Y|X)

0.680

bits lost to noise

Capacity

0.320

bits per channel use

Limit case

partial

clean to useless

3. Mutual information

Mutual information is the overlap between the uncertainty in $X$ and the uncertainty in $Y$. It is the amount you learn about one variable by observing the other:

$$I(X;Y) = H(X) + H(Y) - H(X,Y) = H(Y) - H(Y\mid X)$$

It is always non-negative and symmetric. It is zero exactly when the variables are independent. For the channel above, the most mutual information is one bit: one clean bit goes in, one clean bit comes out.

Mutual information as KL between two measures

The joint law $P_{XY}$ and the product of marginals $P_XP_Y$ live on the same four-cell outcome space. Mutual information is the KL divergence between them: $I(X;Y)=\mathrm{KL}(P_{XY}\Vert P_XP_Y)$.

Probability X = 1 0.50 Flip probability 0.18

KL(PXY ‖ PX PY)

0.000

bits

I(X;Y)

0.000

same number

largest cell contribution

0.000

bits

Information Venn diagram

The $H(X)$ circle and the $H(Y)$ circle overlap by exactly $I(X;Y)$. The crescents are the conditional entropies $H(X\mid Y)$ and $H(Y\mid X)$; the union is the joint entropy $H(X,Y)$.

Probability X = 1 0.50 Flip probability 0.18

Hover or tap a region (or chip above) to see its formula and current value. Click to lock the highlight; click outside the circles to release it.

I(X;Y)

0.320

mutual information: bits learned about X from Y

I(X;Y)

0.320

bits

Normalized MI

0.320

fraction of H(X)

Channel capacity used

0.32

I(X;Y) / 1 bit (max)

The output is a noisy copy, so observing it removes part of the input uncertainty but not all of it.

Nonlinear dependence: correlation can vanish while MI remains

Noise 0.15

Pearson r

0.000

linear association

estimated MI

0.000

binned estimate, bits

reading

—

dependence type

Channel capacity sweep

For a binary symmetric channel with flip probability $\epsilon$, capacity is $1-H_2(\epsilon)$ bits. Biased inputs usually leave some capacity unused.

Input probability P(X = 1) 0.50

The upper curve is channel capacity. The lower curve is the information transmitted by the selected input distribution.

4. Entropy is the convex dual of log-sum-exp

Entropy doesn't only measure uncertainty. It has a precise mathematical role as the convex conjugate of the log-partition function. For any function $f(x)$ on a measurable space, the variational identity says:

$$ \log\!\int e^{f(x)}\,dx \;=\; \sup_q\!\left(\mathbb{E}_q[f(X)] + H(q)\right), $$

with equality achieved by the Gibbs distribution $q^\ast(x) \propto e^{f(x)}$. Read left to right: the log of a normalizing integral equals the best achievable tradeoff between "tracking $f$" and "keeping $q$ spread out." Read right to left: maximizing $\mathbb{E}_q[f] + H(q)$ forces an exponential answer.

This identity does real work elsewhere on the site. Three examples:

Exponential families. Set $f(x) = \eta^\top T(x)$. The left side becomes the log-partition $A(\eta)$ on the Fisher information page; the supremum is achieved by the exp-family member with natural parameter $\eta$. Convexity of $A$ is the same statement as positivity of Fisher information.
Maximum-entropy priors. Treat the supremum as a primal problem with moment constraints; the multipliers are the natural parameters. This is the route to the max-entropy section.
Variational inference. Set $f(\theta) = \log p(y,\theta)$. The left side is $\log p(y)$; the evidence is a log-partition function. The supremum is the ELBO, achieved when $q$ equals the posterior.

Log-sum-exp ↔ entropy as a Legendre pair

natural parameter η 0.00 support size K 4

$A(\eta) = \log\sum_k e^{\eta k}$ on $\{0,1,\dots,K-1\}$. The tangent at $\eta$ has slope $A'(\eta) = \mathbb{E}_q[k]$.

Conjugate $A^*(\mu) = \sup_\eta(\eta\mu - A(\eta))$. On the simplex it equals $-H(q^*)$ at the matching $q^*$.

A(η)

—

log-partition

μ = E_q[k] = A′(η)

—

mean of sufficient stat

H(q*)

—

entropy at optimum

A*(μ) = η·μ − A(η)

—

= −H(q*)

5. What to remember

Entropy is uncertainty in a variable. Conditional entropy is uncertainty that remains after another variable is known. Mutual information is the uncertainty removed by that knowledge.

Correlation asks whether two numbers move together in a particular geometric way. Mutual information asks whether observing one variable changes the distribution you should assign to the other.

What next

These pages pick up the same ideas from nearby angles.

Dependence

Distance Correlation

Compare mutual information's distribution-level view of dependence with a distance-based test that catches nonlinear relationships Pearson misses.

Measure theory

Measure Theory & Random Variables

See why mutual information is a KL divergence between the joint law and the product of marginals.

Divergence

KL Divergence

Connect entropy and cross-entropy to directed distribution mismatch.

Inference

Free Energy & Variational Inference

Follow entropy and KL into the optimization identity behind variational Bayesian inference.