KV Cache & Memory

During generation, each new token reuses the Key and Value matrices from all prior tokens. This KV cache avoids recomputing attention history but grows with sequence length — often becoming the memory bottleneck.

Model

Hardware

Workload

Attention Variant

Precision

Summary

Weights
KV Cache
Total
GPU Memory

GPU Memory Budget

KV Cache Formula

Attention Variant Comparison

Does It Fit?

GPUVRAMWeightsKV CacheTotalFits?GPUs Needed

KV Cache vs Sequence Length

Click a preset to load an interesting configuration.