KV Cache & Memory

During generation, each new token reuses the Key and Value matrices from all prior tokens. This KV cache avoids recomputing attention history but grows with sequence length — often becoming the memory bottleneck.

Model

Hardware

Workload

Batch Size 1 Sequence Length 1024

Attention Variant

Precision

Weight Quantization

KV Cache Precision

Summary

Weights

—

KV Cache

—

Total

—

GPU Memory

—

GPU Memory Budget

KV Cache Formula

Attention Variant Comparison

Does It Fit?

GPU	VRAM	Weights	KV Cache	Total	Fits?	GPUs Needed

KV Cache vs Sequence Length

Click a preset to load an interesting configuration.

Load 70B FP16 — check which GPUs can fit it. Now switch to INT4
Toggle between MHA → GQA → MQA and watch the KV cache size in the formula box change
Set batch to 64 with LLaMA 7B — watch the memory bar overflow. What's the max batch that fits?
Toggle KV cache precision from FP16 to INT8 — is it more or less impactful than weight quantization?

For 100 concurrent users with 4K context, is the bottleneck weight memory or KV cache memory? Does the answer change at 8K?
Which factors in the KV cache formula are set by the model architect vs controllable at inference time?

Batching reuses weights across requests to fill the GPU → Batching Simulator