During generation, each new token reuses the Key and Value matrices from all prior tokens. This KV cache avoids recomputing attention history but grows with sequence length — often becoming the memory bottleneck.
Model
Hardware
Workload
Attention Variant
Precision
Summary
Weights
—
KV Cache
—
Total
—
GPU Memory
—
GPU Memory Budget
KV Cache Formula
Attention Variant Comparison
Does It Fit?
GPU
VRAM
Weights
KV Cache
Total
Fits?
GPUs Needed
KV Cache vs Sequence Length
Click a preset to load an interesting configuration.