Standard attention materializes the full S×S score matrix in GPU memory. Flash Attention tiles the computation to stay in fast SRAM — same math, far less memory traffic.
Sequence Length
Head Dimension (d_head)
Block Size (Flash)
HBM Traffic
Standard: —
Flash: —
Reduction: —
Peak Memory (score matrix)
Standard: —
Flash: —
Standard Attention
HBM Traffic
—
Peak Memory
—
Flash Attention
HBM Traffic
—
Peak Memory
—
Block 0 / 0
Click a preset to load an interesting configuration.
Compare memory traffic at S=64 vs S=4096 — Flash's advantage grows quadratically.
Change block size and watch how the number of tiles and SRAM usage changes.
At what sequence length does Flash Attention's memory savings become dramatic?
Notice that FLOPs are identical — Flash is faster because of fewer HBM accesses, not fewer operations.
Flash Attention does the same FLOPs but fewer memory accesses — why is it faster? (Hint: memory bandwidth is the bottleneck)
What happens if the block size is too large for SRAM? What if it's too small?
Why does standard attention's memory grow as O(S²) while Flash stays O(S)?