How it works

Making it fast

Model Overview

See the full transformer model — every layer stacked. Click any layer to drill into per-operation detail.

Model

Hardware

Quantization

Workload

Batch Size 1 Sequence Length 1 (decode)

Summary

Transformer Layer Stack

Model Totals (per decode token, batch=1)

Click a preset to load an interesting configuration.

Load GPT-2 — only 12 layers. Now switch to 405B to see the scale difference
Click any transformer layer to drill into per-operation detail
Set 70B to INT4 and watch weight memory drop by 4×
Compare the attention-vs-FFN split across different model sizes

How does the attention-vs-FFN FLOPs split change between small and large models?
Why does weight memory scale linearly with parameters but FLOPs scale differently?

Drill into one layer to see per-operation costs → Layer Detail