How it works

Making it fast

Attention

See how each token decides what to pay attention to — the core mechanism of transformers.

Input Sequence

Head Dimension (d_head)

Hover over a row in any matrix to see which tokens it attends to.

Q, K, V Projections

Click a preset to load an interesting configuration.

Click different query tokens (rows) — which keys does each token attend to most?
Switch between Scores and Softmax views — notice how softmax sharpens the attention pattern
Try a longer sequence — does each token attend to everything or just nearby tokens?
Change the head dimension — how does it affect the score magnitudes?

Why divide by sqrt(d_head) before softmax? What would happen without the scaling?
In causal (decoder) attention, the upper triangle of the score matrix is masked. Why?
Multi-head attention runs several heads in parallel. Why might different heads learn different patterns?

Attention uses softmax — see how temperature controls it → Decoding & Temperature
See how Flash Attention makes this fast with tiling → Flash Attention