How it works

Making it fast

Generation

Watch tokens appear one at a time — autoregressive generation. Each new token requires a full forward pass through every layer.

Model

Hardware

Prompt

Controls

Speed 5

Phase

Idle

Token Sequence

KV Cache

Timing

Click a preset to load an interesting configuration.

Watch the prefill phase — all prompt tokens process instantly. Then decode is one-at-a-time. Why the difference?
Look at the KV cache growing — each new token adds one column across ALL layers
Try a longer prompt — prefill time grows, but decode per-token stays the same

Each decode step loads all model weights from memory. At batch=1, what fraction of the GPU's compute is actually used?
Why can't we generate all output tokens in parallel like we process input tokens?

Each generation step runs a full forward pass — see the tensor shapes → Tensor Shapes