GEMM & Tiling Explorer

Every LLM forward pass is dominated by matrix multiplications (GEMMs). Tiling breaks these into blocks that fit in fast GPU SRAM, trading fewer slow HBM accesses for higher throughput. See how tile size, model shape, and GPU specs interact.

Hardware

Model

Layer Operation

Workload

Tiling

Optimizations

Summary (selected op)

FLOPs
Bytes (tiled)
Arith. Intensity
Bound

Matrix Dimensions —

Tiling Grid & Wave Mapping

Single Tile Accumulation

Memory Traffic: Naive vs Tiled

Roofline Model

Per-Layer GEMM Breakdown

Operation M K N FLOPs Bytes (tiled) AI (FLOP/B) Bound Time (ms)

Click a preset to load an interesting configuration.