Under the Hood: LLM Inference Costs

How language models work under the hood — from tokens to tensors to inference costs.

What happens when you send a prompt?

When you type a prompt into ChatGPT, Claude, or any LLM, the same sequence of events fires:

  1. Tokenize your input into a sequence of integer token IDs.
  2. Prefill — feed all input tokens through the model's transformer layers in parallel. This produces the internal representation and the first output token.
  3. Decode — generate output tokens one at a time, each requiring a full pass through every layer.
  4. Repeat step 3 until the model produces a stop token or hits the length limit.
Prefill phase

Processes all input tokens at once.

Compute-bound — the GPU's arithmetic units are the bottleneck. Lots of data to process, but it's all available up front.

Decode phase

Generates one token per step.

Memory-bound — the GPU must load all model weights from memory for each token, but only does a small amount of compute.

The central tension: the GPU has enormous compute power, but getting data to the compute units is the bottleneck. Most of LLM inference engineering is about managing this tension.

Matrix multiplication is everything

Inside each transformer layer, almost all the work is matrix multiplications (GEMMs). There are two main groups:

Attention

Feed-Forward Network (FFN)

For an M×K matrix times a K×N matrix, the compute cost is:

FLOPs = 2 × M × K × N

A model like LLaMA 7B has hidden=4096 and FFN dim=11008. That single FFN Up projection is a 4096×11008 matrix — one multiply-accumulate for every pair of elements, billions of operations per layer, 32 layers deep.

The FFN layers typically account for ~65% of all FLOPs. The attention projections add ~30%. The actual attention matmuls (QKT, Score×V) are only ~5% — until sequence length gets very long.

The two numbers that matter

Every operation has two costs:

The ratio between them is arithmetic intensity:

Arithmetic Intensity = FLOPs / Bytes   (FLOP/byte)

Every GPU has a ridge point — the arithmetic intensity where compute speed and memory bandwidth are in balance. Below the ridge, the operation is memory-bound (waiting for data). Above it, the operation is compute-bound (GPU is fully utilized).

For example, the A100 has 312 TFLOP/s compute and ~1.7 TB/s effective bandwidth, giving a ridge point of about 180 FLOP/byte. Any operation with arithmetic intensity below 180 wastes GPU compute power.

Decode at batch=1 has an arithmetic intensity of about 1 FLOP/byte — the GPU is doing almost nothing while it waits for weights to load. This is why every optimization technique in this workshop exists.

Two learning paths

How LLMs Work (fundamentals)

Tokens
Embed
Position
Attention
Softmax
Generation
Shapes

Making It Fast (optimization)

Model
Layer
Tiling
Flash Attention
KV Cache
Batching
Speculative Decode
Cost

How LLMs Work

Start here to build intuition for what happens inside a language model.

Making It Fast

Once you understand the model, explore the engineering tricks that make inference affordable.

Rules of thumb to take with you

Try these on the Cost Estimator page and see how close they get.