Fisher Information
An "uninformative" prior is not a prior with no assumptions. It is an attempt to avoid privileging one coordinate system for the parameter over another. The usual route to that idea runs through likelihood geometry: the score tells which direction the likelihood rises, Fisher information measures local distinguishability, and Jeffreys prior uses that information as the volume element on parameter space.
1. Likelihood and MLE
For a model $p(x\mid\theta)$ and data $x_{1:n}$, the likelihood is $L(\theta)=p(x_{1:n}\mid\theta)$ and the log-likelihood is $\ell(\theta)=\log L(\theta)$. The maximum likelihood estimate is the parameter value that makes the observed data most probable under the model.
2. The score
The score is the derivative of the log-likelihood: $s(\theta)=\partial_\theta\ell(\theta)$. Positive score means the likelihood rises as $\theta$ increases; negative score means it falls. Interior MLEs occur where the score crosses zero.
3. Fisher information
Fisher information is the expected squared score, $I(\theta)=\mathbb{E}_\theta[s_\theta(X)^2]$. Under regularity conditions it also equals the expected negative curvature, $I(\theta)=-\mathbb{E}_\theta[\partial_\theta^2\log p(X\mid\theta)]$. Large information means nearby parameter values are easier to distinguish.
3.5. Score as infinitesimal tilting; exponential families
The Radon–Nikodym derivative $\frac{d p_{\theta+\Delta\theta}}{d p_\theta} = \exp\!\bigl(\Delta\theta\cdot s_\theta(x)\bigr) + O(\Delta\theta^2)$ says that the score is the direction of infinitesimal tilting: changing $\theta$ a little reweights the measure by a factor that is linear in the score. Fisher information measures how much that tilt varies under $p_\theta$: $I(\theta) = \mathrm{Var}_\theta\bigl[s_\theta(X)\bigr]$. Models where the tilt has a closed-form structure are the exponential families:
$$ p(x\mid\theta) \;=\; h(x)\,\exp\!\bigl(\eta(\theta)\,T(x) - A(\theta)\bigr). $$Here $T(x)$ is the sufficient statistic, the only function of the data the likelihood depends on. The score becomes $s_\theta(x) = \eta'(\theta)\bigl(T(x) - \mathbb{E}_\theta T\bigr)$, and Fisher information collapses to a variance: $I(\theta) = \eta'(\theta)^2 \,\mathrm{Var}_\theta T(X)$. Updating on data tilts the measure by $\eta(\theta) T(x)$; a posterior built on a conjugate prior stays in the same exponential family (its natural parameter just shifts). Conjugate pairs exist for this reason.
4. Jeffreys prior
A flat prior is only flat in the coordinate you chose. For Bernoulli $p$, a flat prior on $p$ is not flat on the log-odds $\phi=\log(p/(1-p))$. Jeffreys prior avoids that coordinate dependence by using Fisher information:
$$ \pi_J(\theta) \propto \sqrt{I(\theta)}. $$For Bernoulli data, $I(p)=1/[p(1-p)]$, so $\pi_J(p)\propto[p(1-p)]^{-1/2}$: the Beta$(1/2,1/2)$ prior. It is not flat in $p$, but it is the same prior after a smooth reparameterization.
4.25. The Fisher metric: Jeffreys in multiple parameters
With one parameter, $\sqrt{I(\theta)}$ looks like just a square root. With two or more, $I(\theta)$ is a positive-definite matrix — the Fisher information metric on the parameter space — and Jeffreys' prior is
$$ \pi_J(\theta) \;\propto\; \sqrt{\det I(\theta)}, $$the volume element of that metric. The square root is no longer cosmetic: it is the same factor that turns a length into an area on a curved surface. For the Normal family $\mathcal{N}(\mu, \sigma^2)$,
$$ I(\mu, \sigma) \;=\; \begin{pmatrix} 1/\sigma^2 & 0 \\ 0 & 2/\sigma^2 \end{pmatrix}, \qquad \sqrt{\det I(\mu, \sigma)} \;=\; \frac{\sqrt 2}{\sigma^2}. $$So Jeffreys on $(\mu, \sigma)$ is $\propto 1/\sigma^2$: flat in $\mu$, blowing up as $\sigma \to 0$. It is improper, but it transforms covariantly under reparameterization — and the figure below makes that visible. The "Fisher ellipses" mode draws the unit ball of $I$ at a grid of $(\mu, \sigma)$ values (the Cramér–Rao ellipse for a unit-sample estimator, up to scale); the "Jeffreys density" mode is a heatmap of $\sqrt{\det I}$; and the "coordinate warp" mode overlays the $(\mu, \log \sigma)$ grid on the $(\mu, \sigma)$ grid.
With the metric in hand, three downstream constructions become geometric. The KL divergence between two nearby normals is, to leading order, $\mathrm{KL}(p_\theta \,\|\, p_{\theta+\delta}) \approx \tfrac12 \delta^\top I(\theta)\, \delta$: half the squared Fisher length of $\delta$. The natural gradient of a loss $\mathcal L(\theta)$ is $I(\theta)^{-1} \nabla \mathcal L(\theta)$ — the Riemannian gradient under this metric, not the Euclidean one. And for exponential families, the canonical and expectation parameters give the same manifold a pair of dually flat affine structures whose Bregman geometry is governed by the log-partition function $A(\eta)$ and its Legendre dual. Subsequent figures will build on these helpers.
4.5. Maximum-entropy priors
Jeffreys' answer to "uninformative" is geometric: pick the prior that is invariant under smooth reparameterization. There is a second answer, due to Jaynes, that is constraint-based: pick the prior that commits the least given what you explicitly know. Formally, maximize the entropy of $\pi(\theta)$ subject to constraints like fixed support, mean, or variance. Lagrange multipliers turn the optimization into an exponential family: the multipliers are the natural parameters, and the constraint functions are the sufficient statistics.
The two recipes answer different questions and often disagree. Jeffreys on an exponential rate gives $\pi_J(\lambda)\propto 1/\lambda$; max-entropy on $[0,\infty)$ with fixed mean gives the exponential distribution itself, not a prior on its rate. The choice depends on whether you want invariance to coordinate change (Jeffreys) or minimal commitment beyond stated facts (max-entropy).
5. Priors, posteriors, and MLE
MLE uses only the likelihood. Bayesian updating multiplies the likelihood by a prior. With Bernoulli data, a Beta$(a,b)$ prior gives a Beta$(a+k,b+n-k)$ posterior. The flat prior, Jeffreys prior, and a weakly informative prior agree more as $n$ grows, but they behave differently near boundaries.
What to remember
This page is the local geometry view of Bayesian modeling. For the global distributional mismatch view, see KL Divergence. For posterior approximation, see Free Energy & Variational Inference.