Batching Simulator

Decode is memory-bound at batch=1: the GPU loads all weights but does little compute per token. Increasing batch reuses the same weights for more tokens, improving arithmetic intensity.

Model

Hardware

Workload

Batch Size 1 Sequence Length 1024

Quantization

GPU Compute Utilization

Throughput & Latency

Arithmetic Intensity vs Batch Size

Static vs Continuous Batching

Memory Budget

Click a preset to load an interesting configuration.

Start at batch=1 — the utilization gauge shows ~1%. Now drag the batch slider up to 128
Switch to INT4 and watch the arithmetic intensity curve shift down (counterintuitive!)
Find the batch size that crosses the ridge point (color changes from orange to green)
Watch throughput vs latency as you increase batch — when does the tradeoff stop being worthwhile?

Switching from FP16 to INT4 makes the arithmetic intensity curve shift down — why does cheaper memory make it harder to saturate compute?
At batch=1, the A100 utilization gauge shows ~1%. What does this mean for the economics of serving a single user?

Speculative decoding generates multiple tokens per step → Speculative Decoding