Fisher Information

Likelihood geometry, score functions, Fisher information, exponential families, and two routes to uninformative priors.

An "uninformative" prior is not a prior with no assumptions. It is an attempt to avoid privileging one coordinate system for the parameter over another. The usual route to that idea runs through likelihood geometry: the score tells which direction the likelihood rises, Fisher information measures local distinguishability, and Jeffreys prior uses that information as the volume element on parameter space.

One sentence: Fisher information says how much the model changes when the parameter moves; Jeffreys prior puts prior mass proportional to the square root of that local information.

1. Likelihood and MLE

For a model $p(x\mid\theta)$ and data $x_{1:n}$, the likelihood is $L(\theta)=p(x_{1:n}\mid\theta)$ and the log-likelihood is $\ell(\theta)=\log L(\theta)$. The maximum likelihood estimate is the parameter value that makes the observed data most probable under the model.

Figure 1 · Likelihood geometry and the MLE

log-likelihood MLE score

model

sample size n 20

successes k 12

sample mean x̄ 0.8

known σ 1

2. The score

The score is the derivative of the log-likelihood: $s(\theta)=\partial_\theta\ell(\theta)$. Positive score means the likelihood rises as $\theta$ increases; negative score means it falls. Interior MLEs occur where the score crosses zero.

Figure 2 · The score points uphill toward the MLE

score $s(p)$ zero score / MLE uphill direction

trials n 24

successes k 15

Figure 2b · Likelihood landscape with score-arrow field

log-likelihood height score arrows MLE

trials n 28

successes k 17

3. Fisher information

Fisher information is the expected squared score, $I(\theta)=\mathbb{E}_\theta[s_\theta(X)^2]$. Under regularity conditions it also equals the expected negative curvature, $I(\theta)=-\mathbb{E}_\theta[\partial_\theta^2\log p(X\mid\theta)]$. Large information means nearby parameter values are easier to distinguish.

Figure 3 · Fisher information as curvature and precision

$nI(p)$ selected $p$ quadratic likelihood approximation

probability p 0.5

sample size n 25

Fisher information is the curvature of KL divergence. Expand $\mathrm{KL}[p_\theta \,\|\, p_{\theta+\Delta\theta}]$ in $\Delta\theta$. The score has zero mean under $p_\theta$, so the linear term vanishes; the second-order term is the Fisher information: $$ \mathrm{KL}\bigl[p_\theta \,\|\, p_{\theta+\Delta\theta}\bigr] \;\approx\; \tfrac{1}{2}\, I(\theta)\,(\Delta\theta)^2. $$ KL is asymmetric; its second-order local form is the symmetric quadratic $\tfrac{1}{2} \Delta\theta^\top I(\theta) \Delta\theta$. In coordinates, this positive-definite quadratic is exactly a Riemannian metric tensor $g_\theta(u,v) = u^\top I(\theta)\, v$ on the manifold of distributions. This is the starting point of information geometry. Fisher's role as "precision of the MLE," "expected squared score," "expected negative curvature of $\log p$," and "local KL distinguishability" are four views of one object.

Figure 3c · KL ≈ ½·I·(Δθ)² — Fisher as the curvature of KL

$\mathrm{KL}[p_\theta \,\|\, p_{\theta+\Delta\theta}]$ quadratic $\tfrac{1}{2} I(\theta)(\Delta\theta)^2$ selected $\Delta\theta$

model

base $\theta$ 0.4

$|\Delta\theta|$ shown 0.2

Small $\Delta\theta$: the two curves coincide. Large $\Delta\theta$: the quadratic underestimates the asymmetry of KL.

Figure 3b · MLE sampling spread and the Cramér-Rao scale

simulated MLEs true parameter $\mathcal N(\theta, 1/(nI(\theta)))$

true probability $p$ 0.35

sample size n 50

4. Score as infinitesimal tilting; exponential families

The Radon–Nikodym derivative $\frac{d p_{\theta+\Delta\theta}}{d p_\theta} = \exp\!\bigl(\Delta\theta\cdot s_\theta(x)\bigr) + O(\Delta\theta^2)$ says that the score is the direction of infinitesimal tilting: changing $\theta$ a little reweights the measure by a factor that is linear in the score. Fisher information measures how much that tilt varies under $p_\theta$: $I(\theta) = \mathrm{Var}_\theta\bigl[s_\theta(X)\bigr]$. Models where the tilt has a closed-form structure are the exponential families:

$$ p(x\mid\theta) \;=\; h(x)\,\exp\!\bigl(\eta(\theta)\,T(x) - A(\theta)\bigr). $$

Here $T(x)$ is the sufficient statistic, the only function of the data the likelihood depends on. The score becomes $s_\theta(x) = \eta'(\theta)\bigl(T(x) - \mathbb{E}_\theta T\bigr)$, and Fisher information collapses to a variance: $I(\theta) = \eta'(\theta)^2 \,\mathrm{Var}_\theta T(X)$. Updating on data tilts the measure by $\eta(\theta) T(x)$; a posterior built on a conjugate prior stays in the same exponential family (its natural parameter just shifts). Conjugate pairs exist for this reason.

Figure 3d · Score reweights samples; Fisher info = variance of the tilt

$p_\theta$ (baseline) tilted $p_{\theta+\Delta\theta}$ score $s_\theta(x)$ (tilt direction)

family

base $\theta$ 1

$\Delta\theta$ 0.3

The tilted density equals baseline times $\exp(\Delta\theta\cdot s_\theta)$ (to first order). Where the score is positive, samples are upweighted; where negative, downweighted.

The log-partition function does triple duty. Inside the exponential family $p(x\mid\eta) = h(x)\exp(\eta^\top T(x) - A(\eta))$, the normalizer $A(\eta) = \log\int h(x)\exp(\eta^\top T(x))\,dx$, the log-partition function, is a single object that encodes the whole geometry of the family in natural coordinates: $$ \nabla A(\eta) = \mathbb{E}_\eta[T(X)], \qquad \nabla^2 A(\eta) = \mathrm{Cov}_\eta(T(X)) = I(\eta). $$ $A$ is convex (its Hessian is a covariance matrix, hence positive semidefinite). Its first derivative gives the mean of the sufficient statistic; its second derivative gives the Fisher information in natural coordinates. The Legendre dual of $A$ is (negative) entropy, which is the foundation of variational inference.

Figure 3e · Log-partition $A(\eta)$: slope = mean, curvature = Fisher info

$A(\eta)$ $A'(\eta) = \mathbb{E}_\eta[T]$ $A''(\eta) = \mathrm{Var}_\eta(T) = I(\eta)$ tangent at selected $\eta$

family

natural parameter $\eta$ 0

Tangent slope at $\eta$ is $\mathbb{E}_\eta[T]$ (red dot); convex curvature is $I(\eta)$ (purple curve). Convexity of $A$ is just "covariance is positive."

5. Jeffreys prior

A flat prior is only flat in the coordinate you chose. For Bernoulli $p$, a flat prior on $p$ is not flat on the log-odds $\phi=\log(p/(1-p))$. Jeffreys prior avoids that coordinate dependence by using Fisher information:

$$ \pi_J(\theta) \propto \sqrt{I(\theta)}. $$

For Bernoulli data, $I(p)=1/[p(1-p)]$, so $\pi_J(p)\propto[p(1-p)]^{-1/2}$: the Beta$(1/2,1/2)$ prior. It is not flat in $p$, but it is the same prior after a smooth reparameterization.

Figure 4 · Flat priors depend on coordinates; Jeffreys prior transforms

flat in p flat in transformed coordinate, mapped back Jeffreys prior

transformation

Figure 4b · Jeffreys prior survives a change of coordinate

Jeffreys density in $p$ pushforward to chosen coordinate and back flat coordinate prior mapped back

coordinate

6. The Fisher metric: Jeffreys in multiple parameters

With one parameter, $\sqrt{I(\theta)}$ looks like just a square root. With two or more, $I(\theta)$ is a positive-definite matrix — the Fisher information metric on the parameter space — and Jeffreys' prior is

$$ \pi_J(\theta) \;\propto\; \sqrt{\det I(\theta)}, $$

the volume element of that metric. The square root is no longer cosmetic: it is the same factor that turns a length into an area on a curved surface. For the Normal family $\mathcal{N}(\mu, \sigma^2)$,

$$ I(\mu, \sigma) \;=\; \begin{pmatrix} 1/\sigma^2 & 0 \\ 0 & 2/\sigma^2 \end{pmatrix}, \qquad \sqrt{\det I(\mu, \sigma)} \;=\; \frac{\sqrt 2}{\sigma^2}. $$

So Jeffreys on $(\mu, \sigma)$ is $\propto 1/\sigma^2$: flat in $\mu$, blowing up as $\sigma \to 0$. It is improper, but it transforms covariantly under reparameterization — and the figure below makes that visible. The "Fisher ellipses" mode draws the unit ball of $I$ at a grid of $(\mu, \sigma)$ values (the Cramér–Rao ellipse for a unit-sample estimator, up to scale); the "Jeffreys density" mode is a heatmap of $\sqrt{\det I}$; and the "coordinate warp" mode overlays the $(\mu, \log \sigma)$ grid on the $(\mu, \sigma)$ grid.

Figure 4d · Fisher metric on the $(\mu, \sigma)$ plane

Fisher unit-ball ellipse field $\sqrt{\det I}$ density (Jeffreys) $(\mu, \log\sigma)$ grid (warp) focus point and CRLB ellipse Fisher geodesic (3D view)

mode

show CRLB at sample size n = 25 show as surface (3D)

Click in the plane to move the focus point. In 3D mode, drag to orbit the camera.

Why $\sqrt 2$, and why σ rather than σ²? The ratio of the $\mu$- and $\sigma$-precisions in $I(\mu, \sigma)$ is exactly $2$, so the Fisher ellipses are not circular — they are taller (along $\mu$) than wide (along $\sigma$) by $\sqrt 2$ everywhere. The same family parameterized by $(\mu, \sigma^2)$ has a different Fisher matrix and different ellipses, but Jeffreys' prior on the family is the same set of probabilities — that invariance is the point.

With the metric in hand, three downstream constructions become geometric. The KL divergence between two nearby normals is, to leading order, $\mathrm{KL}(p_\theta \,\|\, p_{\theta+\delta}) \approx \tfrac12 \delta^\top I(\theta)\, \delta$: half the squared Fisher length of $\delta$. The natural gradient of a loss $\mathcal L(\theta)$ is $I(\theta)^{-1} \nabla \mathcal L(\theta)$ — the Riemannian gradient under this metric, not the Euclidean one. And for exponential families, the canonical and expectation parameters give the same manifold a pair of dually flat affine structures whose Bregman geometry is governed by the log-partition function $A(\eta)$ and its Legendre dual. Subsequent figures will build on these helpers.

7. Maximum-entropy priors

Jeffreys' answer to "uninformative" is geometric: pick the prior that is invariant under smooth reparameterization. There is a second answer, due to Jaynes, that is constraint-based: pick the prior that commits the least given what you explicitly know. Formally, maximize the entropy of $\pi(\theta)$ subject to constraints like fixed support, mean, or variance. Lagrange multipliers turn the optimization into an exponential family: the multipliers are the natural parameters, and the constraint functions are the sufficient statistics.

The two recipes answer different questions and often disagree. Jeffreys on an exponential rate gives $\pi_J(\lambda)\propto 1/\lambda$; max-entropy on $[0,\infty)$ with fixed mean gives the exponential distribution itself, not a prior on its rate. The choice depends on whether you want invariance to coordinate change (Jeffreys) or minimal commitment beyond stated facts (max-entropy).

Why max-entropy keeps producing exponential families. The Lagrangian $\mathcal{L}[q] = -\int q\log q - \sum_i \lambda_i\bigl(\int T_i q - c_i\bigr)$ has a stationary point at $q(x)\propto \exp\bigl(\sum_i \lambda_i T_i(x)\bigr)$, exactly the exponential-family form. The multipliers $\lambda_i$ are the natural parameters; the constraint functions $T_i$ are the sufficient statistics; the log-partition $A(\lambda)$ enforces normalization. This is one half of the Legendre duality between entropy and log-sum-exp.

Figure 4c · Maximum-entropy distributions from explicit constraints

max-entropy density constraint targets natural parameters (Lagrange multipliers)

support & constraints

target mean μ 1.5

target variance σ² (Normal only) 1

Constraints in, exponential-family member out. The natural parameters are the Lagrange multipliers; the sufficient statistics are the constraint functions ($1$, $x$, $x^2$).

8. Priors, posteriors, and MLE

MLE uses only the likelihood. Bayesian updating multiplies the likelihood by a prior. With Bernoulli data, a Beta$(a,b)$ prior gives a Beta$(a+k,b+n-k)$ posterior. The flat prior, Jeffreys prior, and a weakly informative prior agree more as $n$ grows, but they behave differently near boundaries.

Figure 5 · MLE versus posterior under flat, Jeffreys, and weak priors

likelihood posterior MLE posterior mean

trials n 20

successes k 12

prior

Figure 5b · Prior to posterior shrinkage as samples arrive

prior early posterior later posterior data-generating $p$

true $p$ 0.62

observations shown 45

What to remember

LikelihoodHow plausible the observed data are as a function of the parameter.

ScoreThe local slope of log-likelihood; zero at an interior MLE.

Fisher informationThe model's local sensitivity to the parameter.

Jeffreys priorThe prior density proportional to $\sqrt{I(\theta)}$, invariant under smooth reparameterization.

MLEThe likelihood maximizer, recovered as posterior mode in large samples under mild priors.

This page is the local geometry view of Bayesian modeling. For the global distributional mismatch view, see KL Divergence. For posterior approximation, see Free Energy & Variational Inference.