Every LLM forward pass is dominated by matrix multiplications (GEMMs). Tiling breaks these into blocks that fit in fast GPU SRAM, trading fewer slow HBM accesses for higher throughput. See how tile size, model shape, and GPU specs interact.
Hardware
Model
Layer Operation
Workload
Tiling
Optimizations
Summary (selected op)
FLOPs
—
Bytes (tiled)
—
Arith. Intensity
—
Bound
—
Matrix Dimensions —
Tiling Grid & Wave Mapping
Single Tile Accumulation
Memory Traffic: Naive vs Tiled
Roofline Model
Per-Layer GEMM Breakdown
Operation
M
K
N
FLOPs
Bytes (tiled)
AI (FLOP/B)
Bound
Time (ms)
Click a preset to load an interesting configuration.
Try 64×64 tiles and watch wave efficiency drop — now switch to 128×128
Switch between GPUs (T4 → A100 → H100) with the same config — how does the number of waves change?
Look at the memory traffic bars — what's the reduction factor from naive to tiled?
Increase batch or sequence length until the tiling grid shows multiple colorful waves
Does increasing batch size change the arithmetic intensity of an operation? Why or why not?
Why does the best tile size depend on which GPU you're using?
See tiling applied to attention specifically → Flash Attention
KV caching avoids redundant work during decoding → KV Cache & Memory