Under the Hood: LLM Inference Costs
How language models work under the hood — from tokens to tensors to inference costs.
What happens when you send a prompt?
When you type a prompt into ChatGPT, Claude, or any LLM, the same
sequence of events fires:
- Tokenize your input into a sequence of integer
token IDs.
- Prefill — feed all input tokens through the model's
transformer layers in parallel. This produces the internal
representation and the first output token.
- Decode — generate output tokens one at a time, each
requiring a full pass through every layer.
- Repeat step 3 until the model produces a stop token or hits the
length limit.
Prefill phase
Processes all input tokens at once.
Compute-bound — the GPU's arithmetic units are the
bottleneck. Lots of data to process, but it's all available up
front.
Decode phase
Generates one token per step.
Memory-bound — the GPU must load all model weights
from memory for each token, but only does a small amount of
compute.
The central tension: the GPU has enormous compute power, but getting data
to the compute units is the bottleneck. Most of LLM inference engineering
is about managing this tension.
Matrix multiplication is everything
Inside each transformer layer, almost all the work is matrix
multiplications (GEMMs). There are two main groups:
Attention
- Q, K, V projections — three weight matrices multiply
the input to produce queries, keys, and values.
- QKT — queries times keys (transposed) to
compute attention scores. This is the one that scales with sequence
length squared.
- Score × V — attention-weighted sum of values.
- Output projection — one more weight matrix.
Feed-Forward Network (FFN)
- Up projection — expands from hidden size to ~4×
hidden size.
- Gate projection — (in models like LLaMA) a parallel
expansion for gated activation.
- Down projection — contracts back to hidden size.
For an M×K matrix times a K×N matrix, the compute cost is:
FLOPs = 2 × M × K × N
A model like LLaMA 7B has hidden=4096 and
FFN dim=11008. That single FFN Up projection is a
4096×11008 matrix — one multiply-accumulate for every pair of elements,
billions of operations per layer, 32 layers deep.
The FFN layers typically account for ~65% of all FLOPs. The attention
projections add ~30%. The actual attention matmuls (QKT,
Score×V) are only ~5% — until sequence length gets very long.
The two numbers that matter
Every operation has two costs:
- FLOPs — how many floating-point operations it
requires (compute).
- Bytes — how much data must be moved to/from memory
(traffic).
The ratio between them is arithmetic intensity:
Arithmetic Intensity = FLOPs / Bytes (FLOP/byte)
Every GPU has a ridge point — the arithmetic intensity
where compute speed and memory bandwidth are in balance. Below the ridge,
the operation is memory-bound
(waiting for data). Above it, the operation is
compute-bound (GPU is fully
utilized).
For example, the A100 has 312 TFLOP/s compute and ~1.7 TB/s effective
bandwidth, giving a ridge point of about 180 FLOP/byte.
Any operation with arithmetic intensity below 180 wastes GPU compute
power.
Decode at batch=1 has an arithmetic intensity of about 1 FLOP/byte — the
GPU is doing almost nothing while it waits for weights to load.
This is why every optimization technique in this workshop exists.
Two learning paths
How LLMs Work (fundamentals)
Tokens
→
Embed
→
Position
→
Attention
→
Softmax
→
Generation
→
Shapes
Making It Fast (optimization)
Model
→
Layer
→
Tiling
→
Flash Attention
→
KV Cache
→
Batching
→
Speculative Decode
→
Cost
How LLMs Work
Start here to build intuition for what happens inside a language model.
Making It Fast
Once you understand the model, explore the engineering tricks that make inference affordable.
Optimization 0
Model Overview
See all layers of a transformer model stacked —
click any layer to drill into operations.
Optimization 1
Layer Detail
See every operation in a transformer layer — which
are GEMMs, where the FLOPs go, and which are compute-bound vs.
memory-bound.
Optimization 2
GEMM & Tiling
Watch how tiling divides matrices into blocks,
maps them to GPU SMs, and reduces memory traffic by reusing data in
fast memory.
Optimization 3
Flash Attention
Compare standard vs Flash Attention side by side — same math, far less memory traffic.
Optimization 4
KV Cache & Memory
Explore how KV cache memory scales with sequence
length, batch size, and attention variants (MHA, GQA, MQA).
See which models fit on which GPUs.
Optimization 5
Batching Simulator
Watch GPU utilization climb from ~1% at batch=1
to near 100% as you add concurrent requests. Compare static vs.
continuous batching.
Optimization 6
Speculative Decoding
A small draft model proposes tokens, a large model
verifies them in one pass. See when this speeds up generation and
when it doesn't.
Optimization 7
Inference Cost Estimator
The capstone: estimate prefill speed, decode speed,
memory requirements, and $/million tokens for any model on any
GPU.
Rules of thumb to take with you
- FLOPs per token ≈
2 × num_parameters
(forward pass).
- Weight memory ≈
num_parameters ×
bytes_per_param (FP16 = 2 bytes, INT8 = 1, INT4 = 0.5).
- Decode throughput ≈
memory_bandwidth /
weight_memory tokens/sec (memory-bound).
- Prefill throughput ≈
GPU_TFLOPS / (2 ×
num_params) tokens/sec (compute-bound).
- Arithmetic intensity of decode ≈
2 ×
batch_size FLOP/byte (for FP16) — this is why batching
matters.
Try these on the Cost Estimator page and see how close they
get.