What is Speculative Decoding and When Does It Help?
THE LATENCY BOTTLENECK
LLM generation is autoregressive: each token depends on all previous tokens. You cannot parallelize token generation within a single request. A 100-token response requires 100 sequential forward passes, each taking 10-50ms depending on model size.
This sequential nature means latency scales linearly with output length. A 1000-token response takes 10x longer than a 100-token response, regardless of how much hardware you have.
HOW SPECULATIVE DECODING WORKS
Use a smaller, faster draft model to generate candidate tokens. Then verify multiple candidates in parallel with the target model.
Process:
1. Draft model generates K candidate tokens (fast, e.g., 5ms per token)
2. Target model processes all K candidates in a single forward pass (parallel verification)
3. Accept candidates that match target model predictions
4. If candidate i does not match, reject candidates i through K
5. Sample a new token from the target model at the rejection point
6. Repeat
If the draft model is good, many candidates are accepted, and you effectively generate multiple tokens per target model forward pass.
SPEEDUP FACTORS
Speedup depends on draft model quality. If the draft model matches the target model 80% of the time, you get roughly 3-4x speedup on latency. If it matches only 50%, speedup is closer to 1.5-2x.
Speculative decoding maintains exact output distribution—the target model always has final say. This is mathematically guaranteed; the output is identical to running the target model alone.
WHEN TO USE
Best for latency-sensitive applications where throughput is less critical. Interactive chatbots benefit; batch processing does not (continuous batching is more important there).