Layer Overview

This shows every operation inside one transformer layer — how input tokens flow through attention and FFN, and the matrix multiplications involved.

Click any block in the diagram to see its details. Change the model or workload to see how dimensions scale.

Model

Hardware

GPU

Workload

Batch Size 1 Sequence Length 1024

Optimizations

FlashAttention Quantization

Selected Operation

Name

—

Transformer Layer — Operation Flow

FLOPs Breakdown (per layer)

Memory Traffic Breakdown (per layer)

All Operations

Operation	Type	M	K	N	FLOPs	Traffic	AI (F/B)	Bound	Time (ms)	% FLOPs

Roofline Model

Click a preset to load an interesting configuration.

Set sequence to 8192 and watch the FLOPs breakdown shift from FFN-dominated to attention-heavy
Enable FlashAttention at 4K — what changes on the roofline chart? (Same FLOPs, fewer bytes)
Click through operations in the table to find which are memory-bound (orange) vs compute-bound (green)
Set batch=64 and see how the M dimension affects arithmetic intensity

At what sequence length does attention overtake FFN in FLOPs? What does this imply for long-context serving?
Why are attention matmuls memory-bound while the projection GEMMs are compute-bound — what's fundamentally different about their dimensions?

See how matrix multiplications are tiled on the GPU → GEMM & Tiling