KL Divergence
The Kullback-Leibler divergence compares two probability measures on the same measurable space. If $Q$ is absolutely continuous with respect to $P$, written $Q\ll P$, then
$$ \mathrm{KL}[Q\Vert P] = \int \log\frac{dQ}{dP}\,dQ = \mathbb{E}_Q\!\left[\log\frac{dQ}{dP}\right]. $$Equivalently, by changing the integrating measure back to $P$, $\mathrm{KL}[Q\Vert P]=\int \frac{dQ}{dP}\log\frac{dQ}{dP}\,dP$. When both measures have densities $q$ and $p$ with respect to a common base measure, this reduces to the familiar formula $\int q(x)\log(q(x)/p(x))\,dx$. These are Lebesgue integrals; the compact form integrates $\log(dQ/dP)$ with respect to the probability measure $Q$.
Read KL as an expectation under the left-hand measure: samples are drawn from $Q$, and each sample charges the log Radon-Nikodym derivative between what $Q$ expects and what $P$ assigned. That is why KL is directed. The distribution on the left decides where the comparison spends its attention.
1. The Radon-Nikodym derivative inside KL
The figure below keeps the base space ordinary, so $dQ/dP$ is just the density ratio $q(x)/p(x)$. The top panel shows the two probability densities, the middle panel shows the Radon-Nikodym derivative, and the bottom panel shows the KL integrand $q(x)\log(q(x)/p(x))$. In the fully measure-theoretic formula, the middle curve is the object being logged.
If $Q\not\ll P$, the derivative $dQ/dP$ does not exist as a finite density on all the places $Q$ needs it, and $\mathrm{KL}[Q\Vert P]=+\infty$. This is the measure-theoretic version of the discrete rule that putting positive $q_i$ where $p_i=0$ gives infinite KL.
2. Direction matters
KL is zero only when the two distributions match, but it is not a distance. Usually $\mathrm{KL}[q\Vert p] \neq \mathrm{KL}[p\Vert q]$. The figure below uses Gaussians because the exact value is available in closed form, while the lower plot shows the pointwise contribution $q(x)\log(q(x)/p(x))$.
3. The left distribution chooses the bill
The discrete case makes the accounting visible. Each outcome contributes $q_i\log(q_i/p_i)$. If $q_i=0$, that outcome contributes nothing, no matter how large $p_i$ is. If $q_i>0$ and $p_i=0$, the KL is infinite: $p$ says an event that actually happens is impossible.
4. Forward KL covers; reverse KL chooses
When a simple distribution approximates a multi-modal target, the direction changes the qualitative behavior. Minimizing $\mathrm{KL}[p\Vert q]$ tends to cover all mass that $p$ might generate. Minimizing $\mathrm{KL}[q\Vert p]$ tends to place $q$ where it can be confident, often inside one mode.
5. KL is locally quadratic, and Fisher is the curvature
Globally, KL is asymmetric and unbounded. Locally, it is neither. Expand $\mathrm{KL}[p_\theta\Vert p_{\theta+\Delta\theta}]$ in $\Delta\theta$. The score $s_\theta = \partial_\theta\log p_\theta$ has mean zero under $p_\theta$, so the linear term vanishes; the second-order term is the Fisher information:
$$ \mathrm{KL}\bigl[p_\theta \,\Vert\, p_{\theta+\Delta\theta}\bigr] \;\approx\; \tfrac{1}{2}\, \Delta\theta^\top I(\theta)\,\Delta\theta. $$This is the bridge between this page and Fisher information: KL's asymmetry disappears at second order, and the symmetric quadratic that remains is exactly the Fisher metric on the parameter manifold. So statements like "the model is locally sensitive to $\theta$" (Fisher information), "nearby parameters are easy to distinguish" (KL local quadratic), and "the MLE concentrates at rate $1/(nI)$" (Cramér–Rao) are three views of the same curvature.
6. How KL relates to nearby concepts
Use KL when the question is which distribution is responsible for the samples and which approximation is being charged. Support errors that KL refuses to ignore are the other reason to reach for it. For the underlying measure machinery, see Radon-Nikodym derivatives. In variational inference, this directed gap becomes the optimization target; see Free Energy & Variational Inference.