Attention

A weighted sum, three ways. The same operation read as a statistic, a kernel regressor, and a maximum-entropy retrieval rule.

Attention is the operation that lets a transformer mix information across positions of a sequence: each output position pulls a weighted combination of all other positions, with weights that depend on the content at each position rather than its index. In a single line of math, $\operatorname{Attn}(q, K, V) = \sum_i \alpha_i(q)\, v_i$, with weights $\alpha_i(q) \propto \exp(q \cdot k_i / \tau)$.

The lens of this page. Most introductions to attention approach it as an architectural component — queries, keys, values, heads, layers. This page takes a different route. It treats the $\sum_i \alpha_i(q)\, v_i$ formula as the latest member of a family that begins with the classical idea of a sufficient statistic and passes through kernel regression on the way. From that angle the design choices stop looking arbitrary: why a sum is statistics, why softmax is a maximum-entropy retrieval rule, and why a cache of raw vectors is what kernel methods have always needed.

If you want the standard architecture overview — multi-head shapes, layer composition, positional encodings, KV-cache memory in practice — see LLM Inference. This page complements rather than replaces that material; it explains what attention is in terms of older ideas, not how it is wired into a model.

The page makes the equivalences concrete. We start with three forms side by side — the classical sufficient statistic $T(x) = \sum_i \phi(x_i)$, Nadaraya–Watson kernel regression $\hat f(q) = \sum_i w_i(q)\,y_i$, and attention $\sum_i \alpha_i(q)\,v_i$ — and show that they differ only in how the weights are chosen. From there: attention as kernel regression with a learned similarity (§2); softmax as the unique entropy-regularized retrieval rule (§3); a small zoo of alternative kernels (§4); multi-head as many statistics at once (§5); causal masking and the KV cache (§6). A reader arriving from sufficient statistics will recognize this as the adaptive cousin of $T(x)$; a reader arriving cold can read the chain forward without that background.

1. Three forms

Lay the three forms side by side. Only the weights change.

Sufficient statistic

T(x) = Σᵢ φ(xᵢ)

Fixed weights (all equal). The summary is a function of the data alone.

Kernel regression

f̂(q) = Σᵢ wᵢ(q)·yᵢ

Weights depend on a query point. Nearby data contributes more.

Attention

Attn(q) = Σᵢ αᵢ(q)·vᵢ

Weights are softmax over learned dot-product similarities.

The classical statistic compresses with fixed weights; kernel regression varies the weights with the query but uses a hand-chosen similarity; attention varies the weights with the query and learns the similarity. Each step adds one degree of adaptivity. The rest of this page makes the equivalences concrete.

2. Attention is Nadaraya–Watson kernel regression

The Nadaraya–Watson estimator $\hat f(q) = \sum_i K(q, x_i)\, y_i \;/\; \sum_j K(q, x_j)$ is a weighted average of $y$-values, with weights given by a kernel of distance from the query. Soft attention with a Gaussian-shaped softmax ($\alpha_i \propto \exp(-\|q - k_i\|^2/2\sigma^2)$, up to a query-only normalization) is the same operation.

Figure 2 · Same operation, read two ways

data $(x_i, y_i)$: tokens with key $k_i = x_i$, value $v_i = y_i$ kernel bump at query (kernel-regression view) attention weights $\alpha_i$ (attention view) predicted output $\hat f(q) = \sum_i \alpha_i y_i$

query position $q$ 0.5

log temperature (base 2): bandwidth -5

At small $\tau$ the kernel is sharp and the prediction at $q$ is essentially the $y$-value of the nearest data point. Attention puts all its mass on one token, the entropy of the weight distribution is near zero, and the smoothed curve becomes a jagged step function. At large $\tau$ the kernel spans the whole interval, the weight distribution is nearly uniform, and the prediction flattens toward the global mean. Everything in between is a tradeoff between fidelity and stability, the bias–variance dial you already know from non-parametric regression, wearing a softmax hat.

Real attention uses $q \cdot k_i$ with $q = W_Q x_q$ and $k_i = W_K x_i$. After expanding, $\exp(q \cdot k_i / \tau)$ is the exponential kernel in those learned coordinates. Up to a query-only constant, that's a Gaussian RBF kernel with bandwidth set by $\tau$. The projection matrices $W_Q, W_K$ are how the network chooses what counts as "near."

3. Softmax is the unique entropy-regularized retrieval rule

Why softmax? Because among all retrieval distributions $a$ over the memory items, the softmax is the one that maximizes $$\mathbb{E}_a[s] \;+\; \tau\, H(a) \;=\; \sum_i a_i s_i \;-\; \tau \sum_i a_i \log a_i$$ subject to $\sum a_i = 1$. The unique balance between relevance (pick the high-score item) and spread (don't be overconfident). The temperature $\tau$ is the price you pay per nat of certainty.

Figure 3 · Scores → softmax retrieval, with the objective on display

scores $s_i$: relevance of each memory item retrieval distribution $a_i = \operatorname{softmax}(s/\tau)_i$ contributions $a_i s_i$ (signed)

log temperature (base 2) 0

score pattern single peak

Three regimes:

$\tau \to 0$ (argmax). The total objective collapses to $\max_i s_i$; the retrieval distribution puts all mass on the highest-scoring item; entropy is zero. Brittle but maximally relevant.
$\tau \to \infty$ (uniform). The objective is dominated by the entropy term; the retrieval distribution is uniform; effective $k$ is $N$. Maximally spread, scores don't matter.
Intermediate $\tau$. The softmax interpolates. Multiple items contribute; the "effective $k$" (i.e., $\exp H(a)$) tells you roughly how many memory items you're actually pulling from.

The softmax is the unique distribution maximizing $\mathbb{E}_a[s] + \tau H(a)$. Setting up the Lagrangian with $\sum a_i = 1$ and differentiating gives $\log a_i = (s_i - \lambda)/\tau$, i.e. $a_i \propto e^{s_i/\tau}$. Equivalently: among distributions with a fixed expected score $\mathbb{E}_a[s] = \mu$, the maximum-entropy one is an exponential family with $s$ as the sufficient statistic. That's the same algebra that gives Boltzmann distributions in statistical mechanics and exponential families in §7 of the sufficient statistics page: three faces of one identity.

4. A small kernel zoo

The dot-product softmax isn't the only choice. Each attention variant in common use corresponds to a different kernel, and the kernel shape is what controls the retrieval behavior. The slider sets the bandwidth / temperature; the kernel curves show how weight falls off with distance from the query in their respective coordinate systems.

Figure 4 · Kernel shapes for common attention variants

standard softmax: $\exp(q\cdot k / \tau)$, Gaussian-like linear attention: $\phi(q)^\top\phi(k)$ via random features local-window: indicator of $|i - j| \le w$ positional bias (ALiBi-style): linear decay

bandwidth / window 0.2

Each kernel encodes a prior about which positions should be neighbors. The standard softmax gives smooth, global retrieval at $O(N^2)$ cost. Linear attention swaps the kernel for one with an explicit feature map $\phi$, which lets you rearrange $\sum_i \phi(q)^\top \phi(k_i) v_i = \phi(q)^\top \big(\sum_i \phi(k_i) v_i^\top\big)$ and pre-aggregate to bring cost to $O(N)$. Local-window attention restricts the kernel to a sliding interval. Position-bias schemes like ALiBi and T5 relative bias add a position-only decay before the softmax, equivalent to multiplying the kernel by a position prior, the KL-regularized form $a_i \propto r_i \exp(s_i/\tau)$ from the Aside in §3.

5. Multi-head: many statistics at once

A single attention head computes one weighted statistic of the sequence. A multi-head layer computes several in parallel, each with its own learned $W_Q, W_K, W_V$. Interpretability work has found that heads in trained transformers reliably specialize: some track syntactic dependencies, some track positional patterns, some attend to punctuation or sentence boundaries. The illustrations below are stylized patterns inspired by attested phenomena, not real model outputs, but they capture the qualitative shapes you actually see when probing GPT-2 or BERT.

Figure 5 · Four stylized heads over three sentences

row $i$: how token $i$ distributes attention over the sequence causal mask (greyed): attention to future positions

sentence the cat sat on the mat

display all four heads

head index 0

Heads decompose the problem. H1 implements a "look at the previous token" rule, a building block of the induction circuits that let transformers do in-context copying. H2 acts as a syntactic governor: each verb attends to its subject, each preposition to its head verb. H3 is purely positional: weights decay with distance regardless of content. H4 attends to determiners and modifiers. Each head computes a different summary statistic of the prefix; their outputs concatenate and pass through the next layer's matrices, where they can be combined into still higher-level statistics. Multi-head = multiple sufficient statistics in parallel.

6. Causal masking and the KV cache

At generation time a transformer consumes one token at a time. The attention operation $\sum_i \alpha_i(q_t) v_i$ needs every previous $(k_i, v_i)$, so an efficient decoder caches them as it goes. This is the famous "KV cache." The cache stores raw keys and values rather than an accumulated summary, because each new query reweights it differently.

Figure 6 · Token-by-token attention with a growing KV cache

tokens already in the cache current query attention weights $\alpha_i(q_t)$ output $\sum_i \alpha_i v_i$

generation step $t$ 3

temperature tau 0.6

head pattern subject pointer

The "broadcast / uniform" pattern is what an unconditional summary statistic would look like: the same weights regardless of the query, so the cache could be collapsed to one accumulated $\sum v_i$. The non-uniform patterns (subject-pointer, previous-token) reweight differently for each query, which is exactly why a real KV cache stores the raw vectors. Attention is a family of sufficient statistics, one per query, and the cache is the data structure that lets you compute any member of that family on demand. The KV cache memory page explores the practical consequences (memory growth, paging, attention sinks).

7. Back to sufficiency

Read attention as an adaptive sufficient statistic and the weighted sum is just compression of the sequence into a summary chosen for the current prediction. Kernel regression makes the softmax Nadaraya–Watson with learned similarity. From the maximum-entropy side, the softmax is the unique solution to "be relevant, but don't be overconfident." Each view explains a different design choice: why a sum, why softmax, why a cache of raw vectors.

A transformer layer stacks attention with a feed-forward block; depth lets later layers build sufficient statistics of the sufficient statistics from earlier layers. The story of what those stacked summaries actually compute belongs on its own page about representations, coming next.

What next

Foundations

Sufficient Statistics

Where the weighted-sum-as-summary intuition starts. Attention is the adaptive cousin of $T(x) = \sum \phi(x_i)$.

Systems

KV Cache Memory

The practical consequences of caching keys and values: memory growth, paging, and the cost of long contexts.

Likelihood

Fisher Information

Softmax-as-Gibbs and exponential families also drive the geometry of likelihood inference.