Decode is memory-bound at batch=1: the GPU loads all weights but does little compute per token.
Increasing batch reuses the same weights for more tokens, improving arithmetic intensity.
Model
Hardware
Workload
Quantization
GPU Compute Utilization
Throughput & Latency
Arithmetic Intensity vs Batch Size
Static vs Continuous Batching
Memory Budget
Click a preset to load an interesting configuration.
Start at batch=1 — the utilization gauge shows ~1%. Now drag the batch slider up to 128
Switch to INT4 and watch the arithmetic intensity curve shift down (counterintuitive!)
Find the batch size that crosses the ridge point (color changes from orange to green)
Watch throughput vs latency as you increase batch — when does the tradeoff stop being worthwhile?
Switching from FP16 to INT4 makes the arithmetic intensity curve shift down — why does cheaper memory make it harder to saturate compute?
At batch=1, the A100 utilization gauge shows ~1%. What does this mean for the economics of serving a single user?