Measure Theory & Random Variables
Probability is a measure on a space of outcomes, a random variable is a function on that space, and a density is a ratio of two measures. This page builds that picture step by step: first the space of events, then measures on that space, then random variables as measurable maps, and finally the Radon–Nikodym derivative that turns one measure into another.
1. Measurable spaces, measure spaces, and probability measures
A measurable space is a pair $(\Omega, \mathcal{F})$: a set $\Omega$ of outcomes, and a $\sigma$-algebra $\mathcal{F}$ of subsets of $\Omega$ called measurable sets or events. The $\sigma$-algebra says which questions about the outcome are legitimate: it contains $\Omega$, is closed under complements, and is closed under countable unions.
A measure space adds a measure: $(\Omega,\mathcal{F},\mu)$ with $\mu\colon\mathcal{F}\to[0,\infty]$ countably additive on disjoint unions. A probability measure is the special case $\mathbb{P}(\Omega)=1$. Then $(\Omega,\mathcal{F},\mathbb{P})$ is a probability space.
In the figure below, $\Omega$ is the unit square. Drag the four events to overlap or separate them; the bar chart shows the measure $\mathbb{P}$ of each event (here, area). Notice that $\mathbb{P}(A\cup B)=\mathbb{P}(A)+\mathbb{P}(B)$ only when $A$ and $B$ are disjoint; overlap eats some of the sum. Switch views to see complement and intersection identities drawn on the same measurable space.
2. Random variables as measurable functions
A random variable is just a measurable function $X\colon\Omega\to\mathbb{R}$. "Measurable" means that whenever $B$ is a measurable set of real numbers, its preimage $X^{-1}(B)$ is an event in $\mathcal{F}$. This is the condition that lets probabilities of numerical statements such as "$X\in B$" be defined.
Whether $X$ is measurable is a relationship between the function and the σ-field $\mathcal{F}$ we are working with. The same function can be measurable with respect to one σ-field and fail with respect to a coarser one. In the figure below, $\Omega=\{HH,HT,TH,TT\}$ is the outcome of two coin flips and three σ-fields are selectable: $\mathcal{F}_1$ ("only the first coin is observable"), $\mathcal{F}_2$ ("only the second coin"), and $2^\Omega$ (every subset is measurable). Pick an $X$, brush a Borel set $B$ on $\mathbb{R}$, and watch its preimage light up on $\Omega$. The verdict shows whether $X^{-1}(B)\in\mathcal{F}$, which is exactly the condition needed for $P(X\in B)$ to be defined.
The "distribution of $X$" is not really a thing that lives on $\Omega$; it lives on $\mathbb{R}$. It is the pushforward measure
$$ \mathbb{P}_X(B) \;=\; \mathbb{P}\bigl(X^{-1}(B)\bigr), \qquad B\subseteq\mathbb{R}. $$That is, the probability that $X$ lands in $B$ is the $\mathbb{P}$-measure of the set of outcomes that $X$ sends into $B$. Pushing a measure forward by a function is the abstract version of "change of variables."
In Figure 3, $\Omega=[0,1]$ carries Lebesgue measure (uniform). Drag the function $X$, and watch the induced distribution on the $y$-axis appear as a histogram. The teal band is a measurable set $B$ in the target space; its highlighted preimage $X^{-1}(B)$ is the event in $\Omega$ whose measure defines $\mathbb{P}_X(B)$. A steep piece of $X$ stretches a small interval in $\Omega$ over a wide range in $\mathbb{R}$, so the pushforward density there is small; a flat piece concentrates mass.
The pushforward is functorial: a single measurable map $X\colon\Omega\to E$ acts simultaneously on three layers. Sample spaces transform covariantly ($\omega\mapsto X(\omega)$), $\sigma$-algebras transform contravariantly (events on the target pull back to events on the source via $X^{-1}$), and measures transform covariantly again (each $\mu$ on $\Omega$ produces $X_*\mu$ on $E$). The diagram below shows all three flavors of the same map.
3. Absolute continuity between two measures
Now suppose we have two measures on the same measurable space $(\Omega,\mathcal{F})$; call them $\mu$ and $\nu$. We say $\nu$ is absolutely continuous with respect to $\mu$, written $\nu\ll\mu$, if every $\mu$-null set is also $\nu$-null:
$$ \mu(A)=0 \;\Longrightarrow\; \nu(A)=0. $$Intuitively, $\nu$ cannot put mass where the reference measure $\mu$ refuses to look. If $\mu$ is Lebesgue measure on $\mathbb{R}$ and $\nu$ has a point mass at $x_0$, then $\nu\not\ll\mu$ because $\{x_0\}$ has $\mu$-measure zero but $\nu$-measure positive.
Drag the two distributions below. The shaded regions show where each measure puts mass. The red badge lights up when $\nu\ll\mu$ fails: when $\nu$ has positive mass in a region where $\mu$ has none.
4. The Radon–Nikodym derivative
Absolute continuity is exactly the condition that makes a density relative to $\mu$ possible. If $\nu\ll\mu$ (and both are $\sigma$-finite), the Radon–Nikodym theorem says there exists an essentially-unique measurable function $f = \dfrac{d\nu}{d\mu}\colon\Omega\to[0,\infty)$ such that
$$ \nu(A) \;=\; \int_A f \, d\mu \qquad \text{for every } A\in\mathcal{F}. $$This $f$ is the density of $\nu$ with respect to $\mu$. When $\mu$ is Lebesgue measure on $\mathbb{R}$, this is the ordinary probability density function. When $\mu$ is counting measure on a discrete set, $f$ is the probability mass function. The Radon–Nikodym derivative unifies these into one object. It is also the ratio inside the measure-theoretic form of KL divergence: when $\nu\ll\mu$, $\mathrm{KL}[\nu\Vert\mu]=\int \log(d\nu/d\mu)\,d\nu$.
The figure shows $\mu$ and $\nu$ as densities (top), and their ratio $f = d\nu/d\mu$ (bottom). Where $\nu$ is large relative to $\mu$, $f$ is large. Where $\mu$ is large relative to $\nu$, $f$ is small. Where $\mu$ is zero and $\nu$ is not, $f$ blows up, which is just the statement that $\nu\not\ll\mu$ there.
5. Change of variables: where the Jacobian comes from
Combining the pushforward (Section 2) with the Radon–Nikodym derivative gives the classical density transformation formula. Let $X$ take values in $\mathbb{R}^n$ with density $f_X$ against Lebesgue measure $\lambda$, and let $Y = g(X)$ for a $C^1$ diffeomorphism $g\colon U \to V$. The distribution of $Y$ is the pushforward $\mathbb{P}_Y = g_*\mathbb{P}_X$. For any target set $B\subseteq V$,
$$ \mathbb{P}_Y(B) \;=\; \mathbb{P}_X\bigl(g^{-1}(B)\bigr) \;=\; \int_{g^{-1}(B)} f_X(x)\,d\lambda(x). $$Substituting $x = g^{-1}(y)$ multiplies $d\lambda$ by the Jacobian factor $|\det Dg^{-1}(y)|$, so
$$ f_Y(y) \;=\; f_X\bigl(g^{-1}(y)\bigr)\,\bigl|\det Dg^{-1}(y)\bigr| \;=\; \frac{f_X\bigl(g^{-1}(y)\bigr)}{\bigl|\det Dg\bigl(g^{-1}(y)\bigr)\bigr|}. $$In one dimension this reduces to $f_Y(y) = f_X(g^{-1}(y))/|g'(g^{-1}(y))|$. The Jacobian is a local volume ratio: a small region around $x$ has image of $\lambda$-volume $|\det Dg(x)|$ times larger, so the same probability mass spreads over more volume on the $Y$-side and the density there is diluted by that factor. Figure 3 already shows this in one dimension: a steep slope of $X$ stretches a small interval in $\Omega$ over a wide range in $\mathbb{R}$, and the pushforward density drops accordingly.
In two dimensions the determinant is unmistakably a matrix quantity, not just a derivative. The figure below shows a reference grid on the source $U$ and its image on the target $V$ under four choices of $g$. By default the source carries a Gaussian-bump input density $f_X$ and the target shows the resulting $f_Y = f_X/|\det Dg|$, so for every preset you can see how the map reshapes the input distribution — the shear skews it, uniform scaling squashes it, polar maps an $(r, \theta)$-rectangle bump into a curved swath in the Cartesian plane, and reflection flips it. The draggable focus cell shows the local Jacobian image as a small parallelogram, spanned by the two columns of $Dg$ at that point. Toggling the bump off switches the source panel to a shading of $|\det Dg|$ itself — flat for the linear presets (constant Jacobian), and fading from contraction at small $r$ to expansion at large $r$ in the polar case.
This is Radon–Nikodym applied along a coordinate change. Pushing Lebesgue forward under $g$ gives the measure $g_*\lambda$ on $V$ with Radon–Nikodym derivative $d(g_*\lambda)/d\lambda(y) = |\det Dg^{-1}(y)|$; the density formula is the chain rule $\tfrac{d\mathbb{P}_Y}{d\lambda} = \tfrac{d\mathbb{P}_Y}{d(g_*\lambda)} \cdot \tfrac{d(g_*\lambda)}{d\lambda}$ with $d\mathbb{P}_Y/d(g_*\lambda) = f_X \circ g^{-1}$.
6. Mixed and singular: the Lebesgue decomposition
The clean discrete-vs-continuous picture covers the textbook cases, but real distributions don't always pick a side. The Lebesgue decomposition says that relative to any $\sigma$-finite reference $\mu$, every $\sigma$-finite measure $\nu$ splits uniquely as
$$ \nu \;=\; \nu_{\mathrm{ac}} \;+\; \nu_{\mathrm{sing}}, $$where $\nu_{\mathrm{ac}}\ll\mu$ has a Radon–Nikodym derivative (a true density), and $\nu_{\mathrm{sing}}$ lives on a $\mu$-null set. The singular part further splits into atomic mass (point masses, $\delta_{x_0}$) and a continuous singular piece (no atoms, but supported on a Lebesgue-null set; the Cantor distribution is the canonical example).
A spike-and-slab mixture is the everyday example: the prior puts probability $\alpha$ on the point $\theta=0$ and spreads the remaining $1-\alpha$ as a Gaussian slab. Its CDF has both a jump (the spike) and a smooth rise (the slab); no single density w.r.t. Lebesgue can represent it.
"Discrete or continuous?" is the wrong organizing question. The right one is how $\nu$ sits relative to $\mu$: fully absolutely continuous, fully singular, or somewhere in between. The same measure can look one way against Lebesgue and a different way against counting measure or a Gaussian reference. The next figure makes that change-of-base concrete.
7. Change of measure for expectations
The most useful consequence of the Radon–Nikodym theorem is the change-of-measure formula. For any $\nu$-integrable $g$,
$$ \mathbb{E}_\nu[g(X)] \;=\; \int g\,d\nu \;=\; \int g \cdot \frac{d\nu}{d\mu}\, d\mu \;=\; \mathbb{E}_\mu\!\left[g(X)\,\frac{d\nu}{d\mu}(X)\right]. $$You can compute an expectation under $\nu$ by sampling under $\mu$ and reweighting each sample by the density ratio. This is the entire foundation of: importance sampling, likelihood-ratio tests, Girsanov's theorem in stochastic calculus, and modern off-policy evaluation in reinforcement learning.
The figure below estimates $\mathbb{E}_\nu[g]$ in two ways: directly under $\nu$ (left), and by importance sampling from $\mu$ with weights $d\nu/d\mu$ (right). Both converge to the same number, but with very different variance. When $\mu$ and $\nu$ disagree sharply, the importance weights become heavy-tailed and the right-hand estimator gets noisy; that's the practical bite of "absolute continuity is necessary but not sufficient."
Where this goes next
The same measure-change identity appears in several places:
- Statistics. The likelihood ratio in a Neyman–Pearson test is a Radon–Nikodym derivative of one hypothesis's law with respect to another.
- Stochastic calculus. Girsanov's theorem says that under mild conditions you can change the drift of a Brownian motion by multiplying by an explicit exponential martingale, and that martingale is exactly $dQ/dP$.
- Machine learning. The KL divergence is $\mathbb{E}_\nu[\log d\nu/d\mu]$. Importance sampling, off-policy correction, and most "reweighting" tricks are change-of-measure in disguise.
- Information theory. Mutual information is a KL divergence between the joint and the product of marginals, a Radon–Nikodym derivative on a product space.
Once you stop thinking of a density as the probability and start thinking of it as a relationship between two measures, several formulas collapse to one statement: $\int g\, d\nu = \int g \cdot \tfrac{d\nu}{d\mu}\, d\mu$.
What next
The same change-of-measure machinery reappears in information, approximation, and simulation.