GEMM & Tiling Explorer

Every LLM forward pass is dominated by matrix multiplications (GEMMs). Tiling breaks these into blocks that fit in fast GPU SRAM, trading fewer slow HBM accesses for higher throughput. See how tile size, model shape, and GPU specs interact.

Hardware

GPU

Model

Layer Operation

Workload

Batch Size 1 Sequence Length 1024

Tiling

Tile Size (M × N)

Optimizations

FlashAttention Quantization

Summary (selected op)

FLOPs

—

Bytes (tiled)

—

Arith. Intensity

—

Bound

—

Matrix Dimensions —

Tiling Grid & Wave Mapping

Single Tile Accumulation

Memory Traffic: Naive vs Tiled

Roofline Model

Per-Layer GEMM Breakdown

Operation	M	K	N	FLOPs	Bytes (tiled)	AI (FLOP/B)	Bound	Time (ms)

Click a preset to load an interesting configuration.

Try 64×64 tiles and watch wave efficiency drop — now switch to 128×128
Switch between GPUs (T4 → A100 → H100) with the same config — how does the number of waves change?
Look at the memory traffic bars — what's the reduction factor from naive to tiled?
Increase batch or sequence length until the tiling grid shows multiple colorful waves

Does increasing batch size change the arithmetic intensity of an operation? Why or why not?
Why does the best tile size depend on which GPU you're using?

See tiling applied to attention specifically → Flash Attention
KV caching avoids redundant work during decoding → KV Cache & Memory