How it works

Making it fast

Inference Cost Estimator

Model

Hardware

Number of GPUs 1

Workload

Batch Size 1 Input Length (tokens) 1024 Output Length (tokens) 256

Quantization

Estimator

Memory Budget

Rules of Thumb

Throughput Estimates

Per-Layer Op Breakdown (Decode, 1 token)

GPU Comparison

GPU	VRAM	Fits?	GPUs	Decode tok/s	Prefill tok/s	Total Time	$/hr	$/1M tokens

Worked Example

Click a preset to load an interesting configuration.

Load 70B FP16 — note decode throughput and $/1M tokens. Now switch quantization to INT8 — it should roughly double
Walk the estimator ladder: start at #1 and click through to #6 to see each refinement
Set production config (batch=32, input=1K, output=64) and check $/1M tokens
Compare prefill vs decode time: set input=4096, output=64, then flip them

Why is prefill compute-bound but decode memory-bound, when both process tokens through the same layers?
If you double the output length, does the cost per million tokens double? Why or why not?

Back to the overview → Home

Or start from the fundamentals → Tokenizer