Under the Hood: LLM Inference Costs

Tokenize your input into a sequence of integer token IDs.
Prefill — feed all input tokens through the model's transformer layers in parallel. This produces the internal representation and the first output token.
Decode — generate output tokens one at a time, each requiring a full pass through every layer.
Repeat step 3 until the model produces a stop token or hits the length limit.

Prefill phase

Processes all input tokens at once.

Compute-bound — the GPU's arithmetic units are the bottleneck. Lots of data to process, but it's all available up front.

Decode phase

Generates one token per step.

Memory-bound — the GPU must load all model weights from memory for each token, but only does a small amount of compute.

The central tension: the GPU has enormous compute power, but getting data to the compute units is the bottleneck. Most of LLM inference engineering is about managing this tension.

Matrix multiplication is everything

Inside each transformer layer, almost all the work is matrix multiplications (GEMMs). There are two main groups:

Attention

Q, K, V projections — three weight matrices multiply the input to produce queries, keys, and values.
QK^T — queries times keys (transposed) to compute attention scores. This is the one that scales with sequence length squared.
Score × V — attention-weighted sum of values.
Output projection — one more weight matrix.

Feed-Forward Network (FFN)

Up projection — expands from hidden size to ~4× hidden size.
Gate projection — (in models like LLaMA) a parallel expansion for gated activation.
Down projection — contracts back to hidden size.

For an M×K matrix times a K×N matrix, the compute cost is:

FLOPs = 2 × M × K × N

A model like LLaMA 7B has hidden=4096 and FFN dim=11008. That single FFN Up projection is a 4096×11008 matrix — one multiply-accumulate for every pair of elements, billions of operations per layer, 32 layers deep.

The FFN layers typically account for ~65% of all FLOPs. The attention projections add ~30%. The actual attention matmuls (QK^T, Score×V) are only ~5% — until sequence length gets very long.

The two numbers that matter

Every operation has two costs:

FLOPs — how many floating-point operations it requires (compute).
Bytes — how much data must be moved to/from memory (traffic).

The ratio between them is arithmetic intensity:

Arithmetic Intensity = FLOPs / Bytes (FLOP/byte)

Every GPU has a ridge point — the arithmetic intensity where compute speed and memory bandwidth are in balance. Below the ridge, the operation is memory-bound (waiting for data). Above it, the operation is compute-bound (GPU is fully utilized).

For example, the A100 has 312 TFLOP/s compute and ~1.7 TB/s effective bandwidth, giving a ridge point of about 180 FLOP/byte. Any operation with arithmetic intensity below 180 wastes GPU compute power.

Decode at batch=1 has an arithmetic intensity of about 1 FLOP/byte — the GPU is doing almost nothing while it waits for weights to load. This is why every optimization technique in this workshop exists.

Rules of thumb to take with you

FLOPs per token ≈ 2 × num_parameters (forward pass).
Weight memory ≈ num_parameters × bytes_per_param (FP16 = 2 bytes, INT8 = 1, INT4 = 0.5).
Decode throughput ≈ memory_bandwidth / weight_memory tokens/sec (memory-bound).
Prefill throughput ≈ GPU_TFLOPS / (2 × num_params) tokens/sec (compute-bound).
Arithmetic intensity of decode ≈ 2 × batch_size FLOP/byte (for FP16) — this is why batching matters.

Try these on the Cost Estimator page and see how close they get.