Measure Theory & Random Variables

Measurable spaces, measures, pushforwards, and the Radon–Nikodym derivative.

Probability is a measure on a space of outcomes, a random variable is a function on that space, and a density is a ratio of two measures. This page builds that picture step by step: first the space of events, then measures on that space, then random variables as measurable maps, and finally the Radon–Nikodym derivative that turns one measure into another.

1. Measurable spaces, measure spaces, and probability measures

A measurable space is a pair $(\Omega, \mathcal{F})$: a set $\Omega$ of outcomes, and a $\sigma$-algebra $\mathcal{F}$ of subsets of $\Omega$ called measurable sets or events. The $\sigma$-algebra says which questions about the outcome are legitimate: it contains $\Omega$, is closed under complements, and is closed under countable unions.

A measure space adds a measure: $(\Omega,\mathcal{F},\mu)$ with $\mu\colon\mathcal{F}\to[0,\infty]$ countably additive on disjoint unions. A probability measure is the special case $\mathbb{P}(\Omega)=1$. Then $(\Omega,\mathcal{F},\mathbb{P})$ is a probability space.

In the figure below, $\Omega$ is the unit square. Drag the four events to overlap or separate them; the bar chart shows the measure $\mathbb{P}$ of each event (here, area). Notice that $\mathbb{P}(A\cup B)=\mathbb{P}(A)+\mathbb{P}(B)$ only when $A$ and $B$ are disjoint; overlap eats some of the sum. Switch views to see complement and intersection identities drawn on the same measurable space.

Figure 1 · A probability measure on measurable events

Drag events to move them.

2. Random variables as measurable functions

A random variable is just a measurable function $X\colon\Omega\to\mathbb{R}$. "Measurable" means that whenever $B$ is a measurable set of real numbers, its preimage $X^{-1}(B)$ is an event in $\mathcal{F}$. This is the condition that lets probabilities of numerical statements such as "$X\in B$" be defined.

Whether $X$ is measurable is a relationship between the function and the σ-field $\mathcal{F}$ we are working with. The same function can be measurable with respect to one σ-field and fail with respect to a coarser one. In the figure below, $\Omega=\{HH,HT,TH,TT\}$ is the outcome of two coin flips and three σ-fields are selectable: $\mathcal{F}_1$ ("only the first coin is observable"), $\mathcal{F}_2$ ("only the second coin"), and $2^\Omega$ (every subset is measurable). Pick an $X$, brush a Borel set $B$ on $\mathbb{R}$, and watch its preimage light up on $\Omega$. The verdict shows whether $X^{-1}(B)\in\mathcal{F}$, which is exactly the condition needed for $P(X\in B)$ to be defined.

Figure 2 · A pre-image must lie in $\mathcal{F}$

$\Omega$ outcomes $\mathcal{F}$ cells brushed $B$ on $\mathbb{R}$ $X^{-1}(B)$ on $\Omega$

σ-field

$X$

Drag on the right number line to select $B$. Click a tick to select that single value. The verdict updates live.

Figure 2b · The σ-field generated by an observation

outcomes in the same atom of $\sigma(X)$ selected event generated by unions of atoms

union of atoms 1

Image vs. support. The image $X(\Omega)=\{X(\omega):\omega\in\Omega\}$ is just the range of $X$ as a function: every value $X$ can in principle produce. The support of $\mathbb{P}_X$ is the smaller object: the set of values where probability actually lives. For $X\sim\mathrm{Uniform}[0,1]$ the image is $[0,1]$, but $\mathbb{P}_X(\{x\})=0$ for every single $x$. The support is the interval, not the individual points. Image is a set-theoretic fact about the function; support is a probabilistic fact about the pushforward measure.

The "distribution of $X$" is not really a thing that lives on $\Omega$; it lives on $\mathbb{R}$. It is the pushforward measure

$$ \mathbb{P}_X(B) \;=\; \mathbb{P}\bigl(X^{-1}(B)\bigr), \qquad B\subseteq\mathbb{R}. $$

That is, the probability that $X$ lands in $B$ is the $\mathbb{P}$-measure of the set of outcomes that $X$ sends into $B$. Pushing a measure forward by a function is the abstract version of "change of variables."

In Figure 3, $\Omega=[0,1]$ carries Lebesgue measure (uniform). Drag the function $X$, and watch the induced distribution on the $y$-axis appear as a histogram. The teal band is a measurable set $B$ in the target space; its highlighted preimage $X^{-1}(B)$ is the event in $\Omega$ whose measure defines $\mathbb{P}_X(B)$. A steep piece of $X$ stretches a small interval in $\Omega$ over a wide range in $\mathbb{R}$, so the pushforward density there is small; a flat piece concentrates mass.

Figure 3 · Random variable and pushforward measure $\mathbb{P}_X = \mathbb{P}\circ X^{-1}$

$X(\omega)$ density of $\mathbb{P}_X$ one transported interval target set $B$ and preimage $X^{-1}(B)$

Drag points to edit the spline. Drag the gold interval on $\Omega$; drag the teal band on the value axis to choose $B$. Hover the density axis to see $X^{-1}(\{y\})$. Option-click to add or remove a point.

Why bother? Two reasons. First, on uncountable spaces you cannot just "assign probability to each outcome": almost every singleton has measure zero, so densities are the only way to talk about likelihood. Second, treating $X$ as a function lets us cleanly compose: $g(X)$ is just another random variable, and its distribution is the pushforward of $\mathbb{P}_X$ by $g$.

The pushforward is functorial: a single measurable map $X\colon\Omega\to E$ acts simultaneously on three layers. Sample spaces transform covariantly ($\omega\mapsto X(\omega)$), $\sigma$-algebras transform contravariantly (events on the target pull back to events on the source via $X^{-1}$), and measures transform covariantly again (each $\mu$ on $\Omega$ produces $X_*\mu$ on $E$). The diagram below shows all three flavors of the same map.

Figure 3c · Pushforward as a functor: $X$ acting on spaces, $\sigma$-algebras, and measures

covariant (forward) arrow contravariant (pulled back) arrow

Figure 3b · Bayes' theorem as areas in the unit square

prior event $H$ evidence event $E$ $H\cap E$

prior $P(H)$ 0.30

likelihood $P(E|H)$ 0.80

false alarm $P(E|H^c)$ 0.12

3. Absolute continuity between two measures

Now suppose we have two measures on the same measurable space $(\Omega,\mathcal{F})$; call them $\mu$ and $\nu$. We say $\nu$ is absolutely continuous with respect to $\mu$, written $\nu\ll\mu$, if every $\mu$-null set is also $\nu$-null:

$$ \mu(A)=0 \;\Longrightarrow\; \nu(A)=0. $$

Intuitively, $\nu$ cannot put mass where the reference measure $\mu$ refuses to look. If $\mu$ is Lebesgue measure on $\mathbb{R}$ and $\nu$ has a point mass at $x_0$, then $\nu\not\ll\mu$ because $\{x_0\}$ has $\mu$-measure zero but $\nu$-measure positive.

Drag the two distributions below. The shaded regions show where each measure puts mass. The red badge lights up when $\nu\ll\mu$ fails: when $\nu$ has positive mass in a region where $\mu$ has none.

Figure 4 · Absolute continuity $\nu \ll \mu$

$\mu$ (reference) $\nu$

$\mu$ center 0.30

$\mu$ width 0.18

$\nu$ center 0.55

$\nu$ width 0.12

Drag the red support hole inside μ.

4. The Radon–Nikodym derivative

Absolute continuity is exactly the condition that makes a density relative to $\mu$ possible. If $\nu\ll\mu$ (and both are $\sigma$-finite), the Radon–Nikodym theorem says there exists an essentially-unique measurable function $f = \dfrac{d\nu}{d\mu}\colon\Omega\to[0,\infty)$ such that

$$ \nu(A) \;=\; \int_A f \, d\mu \qquad \text{for every } A\in\mathcal{F}. $$

This $f$ is the density of $\nu$ with respect to $\mu$. When $\mu$ is Lebesgue measure on $\mathbb{R}$, this is the ordinary probability density function. When $\mu$ is counting measure on a discrete set, $f$ is the probability mass function. The Radon–Nikodym derivative unifies these into one object. It is also the ratio inside the measure-theoretic form of KL divergence: when $\nu\ll\mu$, $\mathrm{KL}[\nu\Vert\mu]=\int \log(d\nu/d\mu)\,d\nu$.

Mass and density are the same object in different coordinates. A probability measure $\mathbb{P}$ is the invariant object; "mass function" and "density function" are two ways of writing it relative to a choice of base measure $\mu$. $$ \text{PMF} \;=\; \frac{d\mathbb{P}}{d\#}, \qquad \text{PDF} \;=\; \frac{d\mathbb{P}}{d\lambda}, $$ where $\#$ is counting measure and $\lambda$ is Lebesgue. The discrete-vs-continuous split is not a property of $\mathbb{P}$; it is a property of the reference measure you chose to express it. Change the base and the "density" rescales: $\frac{d\mathbb{P}}{d\tilde\mu} = \frac{d\mathbb{P}}{d\mu}\cdot\frac{d\mu}{d\tilde\mu}$. Same vector, new basis.

The figure shows $\mu$ and $\nu$ as densities (top), and their ratio $f = d\nu/d\mu$ (bottom). Where $\nu$ is large relative to $\mu$, $f$ is large. Where $\mu$ is large relative to $\nu$, $f$ is small. Where $\mu$ is zero and $\nu$ is not, $f$ blows up, which is just the statement that $\nu\not\ll\mu$ there.

Figure 5 · The density $d\nu/d\mu$

$\mu$ $\nu$ $d\nu/d\mu$

parameter 0.5

Figure 5b · Shrinking-ball view of the Radon-Nikodym derivative

$\mu(B_r(x))$ $\nu(B_r(x))$ $\nu(B_r)/\mu(B_r)$

center $x$ 0.8

radius $r$ 0.6

Notation that bites. The symbol $d\nu/d\mu$ looks like a derivative because in one familiar case ($\mu$ Lebesgue on $\mathbb{R}$) it really is one: $f(x) = F_\nu'(x)$ where $F_\nu$ is the CDF. But in general it is not a limit of ratios of intervals; it is the function whose integral against $\mu$ reproduces $\nu$. The Lebesgue differentiation theorem links the two views: for $\mu$-a.e. $x$, $f(x) = \lim_{r\to 0} \nu(B_r(x))/\mu(B_r(x))$.

5. Change of variables: where the Jacobian comes from

Combining the pushforward (Section 2) with the Radon–Nikodym derivative gives the classical density transformation formula. Let $X$ take values in $\mathbb{R}^n$ with density $f_X$ against Lebesgue measure $\lambda$, and let $Y = g(X)$ for a $C^1$ diffeomorphism $g\colon U \to V$. The distribution of $Y$ is the pushforward $\mathbb{P}_Y = g_*\mathbb{P}_X$. For any target set $B\subseteq V$,

$$ \mathbb{P}_Y(B) \;=\; \mathbb{P}_X\bigl(g^{-1}(B)\bigr) \;=\; \int_{g^{-1}(B)} f_X(x)\,d\lambda(x). $$

Substituting $x = g^{-1}(y)$ multiplies $d\lambda$ by the Jacobian factor $|\det Dg^{-1}(y)|$, so

$$ f_Y(y) \;=\; f_X\bigl(g^{-1}(y)\bigr)\,\bigl|\det Dg^{-1}(y)\bigr| \;=\; \frac{f_X\bigl(g^{-1}(y)\bigr)}{\bigl|\det Dg\bigl(g^{-1}(y)\bigr)\bigr|}. $$

In one dimension this reduces to $f_Y(y) = f_X(g^{-1}(y))/|g'(g^{-1}(y))|$. The Jacobian is a local volume ratio: a small region around $x$ has image of $\lambda$-volume $|\det Dg(x)|$ times larger, so the same probability mass spreads over more volume on the $Y$-side and the density there is diluted by that factor. Figure 3 already shows this in one dimension: a steep slope of $X$ stretches a small interval in $\Omega$ over a wide range in $\mathbb{R}$, and the pushforward density drops accordingly.

In two dimensions the determinant is unmistakably a matrix quantity, not just a derivative. The figure below shows a reference grid on the source $U$ and its image on the target $V$ under four choices of $g$. By default the source carries a Gaussian-bump input density $f_X$ and the target shows the resulting $f_Y = f_X/|\det Dg|$, so for every preset you can see how the map reshapes the input distribution — the shear skews it, uniform scaling squashes it, polar maps an $(r, \theta)$-rectangle bump into a curved swath in the Cartesian plane, and reflection flips it. The draggable focus cell shows the local Jacobian image as a small parallelogram, spanned by the two columns of $Dg$ at that point. Toggling the bump off switches the source panel to a shading of $|\det Dg|$ itself — flat for the linear presets (constant Jacobian), and fading from contraction at small $r$ to expansion at large $r$ in the polar case.

Figure 5a · The Jacobian as a local area ratio

source: $|\det Dg|$ (or $f_X$ when bump is on); target: $f_Y$ focus cell and its image columns of $Dg$ at the focus

map $g$

show orientation marker (F inside focus cell — mirrored when $\det Dg < 0$) non-uniform $f_X$ (Gaussian bump)

Drag inside the source panel to move the focus cell.

Why the absolute value? Probability density is non-negative even when $g$ reverses orientation. The signed determinant $\det Dg$ records orientation, which matters for integrating differential forms but is invisible to a measure. Only the magnitude of the local volume scaling enters the density formula. The reflection preset in Figure 5a makes the distinction visible: enable the orientation marker and the "F" inside the focus cell appears as a mirrored F on the target, but the density shading is unchanged.

This is Radon–Nikodym applied along a coordinate change. Pushing Lebesgue forward under $g$ gives the measure $g_*\lambda$ on $V$ with Radon–Nikodym derivative $d(g_*\lambda)/d\lambda(y) = |\det Dg^{-1}(y)|$; the density formula is the chain rule $\tfrac{d\mathbb{P}_Y}{d\lambda} = \tfrac{d\mathbb{P}_Y}{d(g_*\lambda)} \cdot \tfrac{d(g_*\lambda)}{d\lambda}$ with $d\mathbb{P}_Y/d(g_*\lambda) = f_X \circ g^{-1}$.

6. Mixed and singular: the Lebesgue decomposition

The clean discrete-vs-continuous picture covers the textbook cases, but real distributions don't always pick a side. The Lebesgue decomposition says that relative to any $\sigma$-finite reference $\mu$, every $\sigma$-finite measure $\nu$ splits uniquely as

$$ \nu \;=\; \nu_{\mathrm{ac}} \;+\; \nu_{\mathrm{sing}}, $$

where $\nu_{\mathrm{ac}}\ll\mu$ has a Radon–Nikodym derivative (a true density), and $\nu_{\mathrm{sing}}$ lives on a $\mu$-null set. The singular part further splits into atomic mass (point masses, $\delta_{x_0}$) and a continuous singular piece (no atoms, but supported on a Lebesgue-null set; the Cantor distribution is the canonical example).

A spike-and-slab mixture is the everyday example: the prior puts probability $\alpha$ on the point $\theta=0$ and spreads the remaining $1-\alpha$ as a Gaussian slab. Its CDF has both a jump (the spike) and a smooth rise (the slab); no single density w.r.t. Lebesgue can represent it.

Figure 6 · Lebesgue decomposition: spike-and-slab mixture

atomic mass at $0$ continuous slab CDF $F(x)$

atom weight $\alpha$ 0.3

slab width $\sigma$ 0.8

"Discrete or continuous?" is the wrong organizing question. The right one is how $\nu$ sits relative to $\mu$: fully absolutely continuous, fully singular, or somewhere in between. The same measure can look one way against Lebesgue and a different way against counting measure or a Gaussian reference. The next figure makes that change-of-base concrete.

Figure 7 · Same $\mathbb{P}$, different reference measures

$\mathbb{P}$ (target) reference $\mu$ $d\mathbb{P}/d\mu$

$\mathbb{P}$

base $\mu$

The same $\mathbb{P}$ is one object; the "density" rescales when you change reference. Mismatch (e.g., counting $\#$ for Gaussian $\mathbb{P}$) means $\mathbb{P}\not\ll\mu$, so no density exists.

7. Change of measure for expectations

The most useful consequence of the Radon–Nikodym theorem is the change-of-measure formula. For any $\nu$-integrable $g$,

$$ \mathbb{E}_\nu[g(X)] \;=\; \int g\,d\nu \;=\; \int g \cdot \frac{d\nu}{d\mu}\, d\mu \;=\; \mathbb{E}_\mu\!\left[g(X)\,\frac{d\nu}{d\mu}(X)\right]. $$

You can compute an expectation under $\nu$ by sampling under $\mu$ and reweighting each sample by the density ratio. This is the entire foundation of: importance sampling, likelihood-ratio tests, Girsanov's theorem in stochastic calculus, and modern off-policy evaluation in reinforcement learning.

The figure below estimates $\mathbb{E}_\nu[g]$ in two ways: directly under $\nu$ (left), and by importance sampling from $\mu$ with weights $d\nu/d\mu$ (right). Both converge to the same number, but with very different variance. When $\mu$ and $\nu$ disagree sharply, the importance weights become heavy-tailed and the right-hand estimator gets noisy; that's the practical bite of "absolute continuity is necessary but not sufficient."

Figure 8 · Importance sampling via $d\nu/d\mu$

$\nu$ mean shift $\Delta$ 0.5

# samples 500

Where this goes next

The same measure-change identity appears in several places:

Statistics. The likelihood ratio in a Neyman–Pearson test is a Radon–Nikodym derivative of one hypothesis's law with respect to another.
Stochastic calculus. Girsanov's theorem says that under mild conditions you can change the drift of a Brownian motion by multiplying by an explicit exponential martingale, and that martingale is exactly $dQ/dP$.
Machine learning. The KL divergence is $\mathbb{E}_\nu[\log d\nu/d\mu]$. Importance sampling, off-policy correction, and most "reweighting" tricks are change-of-measure in disguise.
Information theory. Mutual information is a KL divergence between the joint and the product of marginals, a Radon–Nikodym derivative on a product space.

Once you stop thinking of a density as the probability and start thinking of it as a relationship between two measures, several formulas collapse to one statement: $\int g\, d\nu = \int g \cdot \tfrac{d\nu}{d\mu}\, d\mu$.

What next

The same change-of-measure machinery reappears in information, approximation, and simulation.

Divergence

KL Divergence

Turn the density ratio idea into a directed mismatch between probability laws.

Information

Entropy & Mutual Information

Mutual information is a KL divergence between two measures: the joint law and the product of marginals.

Approximation

Free Energy & Variational Inference

Use KL to turn an intractable posterior measure into a tractable optimization problem.

Simulation

Monte Carlo & MCMC

Importance sampling is the Radon–Nikodym derivative turned into an estimator; the particle-filter version is on the state-space-filtering page.