LLM Inference Explorer
How it works
Tokens
Embed
Position
Attention
Softmax
Generation
Shapes
Making it fast
Model
Layer
Tiling
Flash Attention
KV Cache
Batching
Speculative Decode
Cost
Attention
See how each
token
decides what to pay
attention
to — the core mechanism of transformers.
Input Sequence
Head Dimension (d_head)
4
8
16
Hover over a row in any matrix to see which tokens it attends to.
Q, K, V Projections
Presets
Explore
Questions
What Next
Tips
▼
Click a preset to load an interesting configuration.
The cat sat on
I love dogs and cats
To be or not to be
Click different query tokens (rows) — which keys does each token attend to most?
Switch between Scores and Softmax views — notice how softmax sharpens the attention pattern
Try a longer sequence — does each token attend to everything or just nearby tokens?
Change the head dimension — how does it affect the score magnitudes?
Why divide by sqrt(
d_head
) before
softmax
? What would happen without the scaling?
In causal (decoder) attention, the upper triangle of the score matrix is masked. Why?
Multi-head attention runs several heads in parallel. Why might different heads learn different patterns?
Attention uses softmax — see how temperature controls it →
Softmax & Temperature
See how Flash Attention makes this fast with tiling →
Flash Attention