How it works

Making it fast

Tensor Shapes

Follow a tensor through one transformer layer. See how dimensions change at each stage — this is why FFN dominates FLOPs.

Model

Batch Size 1

Sequence Length 64

Tensor Flow Through One Layer

FLOPs Summary

Click a preset to load an interesting configuration.

Increase sequence length — which operations grow fastest? (QK^T and Score×V scale as S²)
Compare FFN Up dimensions to Q Projection dimensions — FFN Up is [d, 4d] vs Q's [d, d]. That's why FFN dominates FLOPs.
Switch between GPT-2 and LLaMA 70B — how do the relative proportions change?

Why does FFN typically account for ~65% of FLOPs? Look at the weight matrix dimensions.
At what sequence length does the attention FLOPs (which scale as S²) overtake the FFN FLOPs (which scale as S)?

Now zoom out to see all layers stacked → Model Overview