LLM Inference Explorer
How it works
Tokens
Embed
Position
Attention
Softmax
Generation
Shapes
Making it fast
Model
Layer
Tiling
Flash Attention
KV Cache
Batching
Speculative Decode
Cost
GEMM
&
Tiling
Explorer
Try this:
Pick a model — see how matrix sizes differ
Pick a GPU — watch the roofline and wave map change
Drag batch/seq sliders — M dimension grows
Change tile size — see wave efficiency shift
Click rows in the table to inspect each op
Hardware
GPU
Model
Layer Operation
All Layers (summary)
Workload
Batch Size
1
Sequence Length
1024
Tiling
Tile Size (M × N)
Optimizations
FlashAttention
Quantization
FP16
INT8
INT4
Summary (selected op)
FLOPs
—
Bytes (tiled)
—
Arith. Intensity
—
Bound
—
Matrix Dimensions —
Tiling Grid & Wave Mapping
▶ Waves
Single Tile Accumulation
Memory Traffic: Naive vs Tiled
Roofline Model
Per-Layer GEMM Breakdown
Operation
M
K
N
FLOPs
Bytes (tiled)
AI (FLOP/B)
Bound
Time (ms)
Presets
Explore
Questions
What Next
Tips
▼
Click a preset to load an interesting configuration.
Small tiles
Optimal tiles
H100
Try 64×64 tiles
and watch wave efficiency drop — now switch to 128×128
Switch between GPUs (T4 → A100 → H100) with the same config — how does the number of waves change?
Look at the memory traffic bars — what's the reduction factor from naive to tiled?
Increase batch or sequence length until the tiling grid shows multiple colorful waves
Does increasing batch size change the arithmetic intensity of an operation? Why or why not?
Why does the best tile size depend on which GPU you're using?
See tiling applied to attention specifically →
Flash Attention
KV caching avoids redundant work during decoding →
KV Cache & Memory