LLM Inference Explorer
How it works
Tokens
Embed
Position
Attention
Softmax
Generation
Shapes
Making it fast
Model
Layer
Tiling
Flash Attention
KV Cache
Batching
Speculative Decode
Cost
Generation
Watch
tokens
appear one at a time —
autoregressive
generation. Each new token requires a full forward pass through every layer.
Model
Hardware
Prompt
Controls
Play
Step
Reset
Speed
5
Phase
Idle
Token Sequence
KV Cache
Timing
Presets
Explore
Questions
What Next
Tips
▼
Click a preset to load an interesting configuration.
Short prompt
Medium prompt
Long prompt
Watch the prefill phase — all prompt tokens process instantly. Then decode is one-at-a-time. Why the difference?
Look at the KV cache growing — each new token adds one column across ALL layers
Try a longer prompt — prefill time grows, but decode per-token stays the same
Each decode step loads all model weights from memory. At batch=1, what fraction of the GPU's compute is actually used?
Why can't we generate all output tokens in parallel like we process input tokens?
Each generation step runs a full forward pass — see the tensor shapes →
Tensor Shapes