Flash Attention

Standard attention materializes the full S×S score matrix in GPU memory. Flash Attention tiles the computation to stay in fast SRAM — same math, far less memory traffic.

Sequence Length

S = 512

Head Dimension (d_head)

Block Size (Flash)

HBM Traffic

Standard: —

Flash: —

Reduction: —

Peak Memory (score matrix)

Standard: —

Flash: —

Standard Attention

HBM Traffic

—

Peak Memory

—

Flash Attention

HBM Traffic

—

Peak Memory

—

Block 0 / 0

Click a preset to load an interesting configuration.

Compare memory traffic at S=64 vs S=4096 — Flash's peak memory advantage grows quadratically.
Change block size and watch how the number of tiles and SRAM usage changes.
At what sequence length does Flash Attention's memory savings become dramatic?
Notice that FLOPs are identical — Flash is faster because of less peak memory, not fewer operations.
Try d_head=128, block_size=32 — Flash actually uses more HBM traffic because it re-reads K,V blocks many times. Now try d_head=32, block_size=128 — fewer, larger tiles mean fewer re-reads and Flash wins on traffic too.
The HBM traffic tradeoff depends on the ratio d_head/block_size. When d is large relative to block size, Flash re-reads more data per tile than standard writes to its score matrix.

Flash Attention always wins on peak memory. Does it always win on HBM traffic too? Try different d_head and block_size values.
What happens if the block size is too large for SRAM? What if it's too small?
Why does standard attention's memory grow as O(S²) while Flash stays O(block_size²)?
When d_head > block_size, Flash's HBM traffic can exceed standard's. Why? (Hint: count how many times each K,V block is re-read from HBM.)

See how temperature controls the softmax in attention → Decoding & Temperature
Understand the full attention mechanism → Attention
See how tiling applies to matrix multiplications → GEMM Tiling