# Speculative Decoding
Parent: [[Efficient Transformer Architectures for Edge]]
An inference-time trick that exploits a structural inefficiency of autoregressive generation. The big model must produce tokens one at a time, each conditioned on the last. But most tokens in most sequences are easy — a smaller, faster model could predict them correctly. Why not let it?
The protocol: a cheap draft model generates several candidate tokens ahead of time. The expensive target model then verifies them in parallel, in a single forward pass. If the draft and the target agree on a token, keep it. If they disagree, keep the prefix up to the disagreement, use the target model's prediction at the disagreement point, and restart drafting from there. The key theoretical result, due to Leviathan et al., is that the output distribution is mathematically identical to what the target model would have produced on its own — no quality loss, only latency savings.
The speedup depends on how often the draft model is right. On easy tokens (function words, common completions), it is right almost always and you get multiple tokens per target-model forward pass. On hard tokens (named entities, technical terms), it is often wrong and you fall back to single-token generation. In practice, 2-3x end-to-end speedups are routine, and more is possible with better drafters.
The engineering is subtle. The draft model needs to be fast enough that its cost is dominated by the target model's. The KV cache has to be managed across speculation rollbacks. Tree-structured speculation (generating multiple candidate continuations) can squeeze out more throughput at some implementation cost.
## Related
- [[Grouped Query Attention (GQA)]]
- [[KV Cache Compression and Eviction]]
---
Tags: #ai #inference #edge #kp