LLM Inference Explorer
How it works
Model
Layer
Tokens
Embed
Position
Attention
Decoding
Generation
Shapes
Making it fast
GEMM
Flash Attention
KV Cache
Batching
Speculative Decode
Cost
Decoding
The model outputs
logits
for every token in its vocabulary.
Softmax
converts these to probabilities, then a
sampling strategy
selects the next token.
Temperature
T = 1.0
0.1 (peaked)
1.0 (default)
3.0 (flat)
Sampling Strategy
Greedy
Top-k
Top-p
Top-k + p
Top-k
5
Top-p
0.90
Statistics
Top-1 Prob
—
Candidates
—
Filtered
—
Entropy
—
bits
Eff. Vocab
—
tokens
Strategy
—
Top Tokens
Logits (raw model scores — fixed)
Sampling Distribution
Samples
Resample
Presets
Explore
Questions
What Next
Tips
▼
Click a preset to load an interesting configuration.
Greedy
Nucleus (p=0.9)
Top-k (k=5)
Creative
Set top-k to 5 and click Resample repeatedly — how many distinct tokens appear?
Compare top-p=0.9 at T=0.5 vs T=2.0 — how does the candidate count change?
Why does greedy always produce the same token, while top-p gives variety?
Set top-k=20 (all tokens) — how does it compare to no filtering?
Why does top-p adapt to the distribution shape while top-k doesn't?
If T=0.1 makes the distribution very peaked, does top-k even matter at low temperature?
How do top-k and top-p interact when combined? Which filter is applied first?
Now see how the model generates tokens one at a time →
Autoregressive Generation
See how attention uses softmax for weight computation →
Attention