Learn→Natural Language Processing Systems→LLM Serving (KV-cache, Continuous Batching, Speculative Decoding)→3 of 6

Natural Language Processing Systems • LLM Serving (KV-cache, Continuous Batching, Speculative Decoding)Hard⏱️ ~3 min

What is Speculative Decoding and When Does It Help?

THE LATENCY BOTTLENECK
LLM generation is autoregressive: each token depends on all previous tokens. You cannot parallelize token generation within a single request. A 100-token response requires 100 sequential forward passes, each taking 10-50ms depending on model size.
This sequential nature means latency scales linearly with output length. A 1000-token response takes 10x longer than a 100-token response, regardless of how much hardware you have.
HOW SPECULATIVE DECODING WORKS
Use a smaller, faster draft model to generate candidate tokens. Then verify multiple candidates in parallel with the target model.
Process:
1. Draft model generates K candidate tokens (fast, e.g., 5ms per token)
2. Target model processes all K candidates in a single forward pass (parallel verification)
3. Accept candidates that match target model predictions
4. If candidate i does not match, reject candidates i through K
5. Sample a new token from the target model at the rejection point
6. Repeat
If the draft model is good, many candidates are accepted, and you effectively generate multiple tokens per target model forward pass.
SPEEDUP FACTORS
Speedup depends on draft model quality. If the draft model matches the target model 80% of the time, you get roughly 3-4x speedup on latency. If it matches only 50%, speedup is closer to 1.5-2x.
Speculative decoding maintains exact output distribution—the target model always has final say. This is mathematically guaranteed; the output is identical to running the target model alone.
WHEN TO USE
Best for latency-sensitive applications where throughput is less critical. Interactive chatbots benefit; batch processing does not (continuous batching is more important there).
When To Use: Speculative decoding for latency-critical interactive use cases. Skip for batch processing where continuous batching provides better throughput gains.

💡 Key Takeaways

✓LLM generation is sequential; latency = tokens × time_per_token regardless of hardware

✓Speculative decoding: draft model generates candidates, target model verifies in parallel; 2-4x latency reduction

✓Speedup depends on draft quality (80% match = 3-4x speedup); maintains exact output distribution

📌 Interview Tips

1Interview Tip: Walk through the speculative decoding process: draft K tokens, verify, accept/reject, repeat.

2Interview Tip: Explain when speculative decoding helps (interactive latency) vs when it does not (batch throughput).

← Back to LLM Serving (KV-cache, Continuous Batching, Speculative Decoding) Overview