Speculative Decoding

A small draft model proposes K tokens quickly. The large target model verifies all K in one forward pass. If accepted, you get K tokens for ~1 large-model step.

Target Model

Draft Model

Hardware

Parameters

Draft tokens (K) 4 Acceptance rate (α) 0.80

Speculative Decoding — Cycle View

Ready

Draft

Verify ↓

Target

Speed 5

Output Sequence

Tokens will appear here as cycles complete.

—

expected speedup

Time Breakdown per Cycle

Speedup vs Acceptance Rate

Click a preset to load an interesting configuration.

Run at K=4, α=0.80 and step through the animation — watch the accept/reject pattern
Drag α from 0.95 down to 0.30 — at what point does the speedup drop below 1.1×?
Set K=6, α=0.40 then try K=2, α=0.40 — shorter speculation wins at low acceptance
Try different draft models — how does the draft-to-target size ratio affect speedup?

Why can the target model verify K draft tokens in the same time as generating 1 token?
At what acceptance rate does K=2 actually outperform K=6? Why does shorter speculation win when acceptance is low?

Put it all together: estimate real inference costs → Cost Estimator