Learn→Natural Language Processing Systems→Text Generation (Beam Search, Sampling, Decoding)→6 of 6

Natural Language Processing Systems • Text Generation (Beam Search, Sampling, Decoding)Hard⏱️ ~2 min

Speculative Decoding and Latency Optimization

Key Insight
Speculative decoding uses a small, fast model to draft multiple tokens, then verifies them in parallel with the large model. If the draft is accepted, you get multiple tokens for the cost of one large-model forward pass.
How It Works
The draft model (7B parameters) generates 5 candidate tokens in 25ms. The target model (70B parameters) would take 250ms to generate those same 5 tokens sequentially. Instead, run the target model once on all 5 draft tokens in parallel: 60ms. If 4 tokens match, you saved 190ms.
Verification: For each draft token, check if the target model agrees. If the target model would have generated the same token (or accepts it probabilistically), keep it. At the first rejection, discard that token and all following drafts. Generate the correct token from the target model and restart drafting.
When It Works Well
Speculative decoding shines when the draft model has high agreement with the target. For predictable text (code with clear patterns, formulaic language), acceptance rates hit 80-90%. For creative text where the target model might choose any of many valid continuations, acceptance drops to 40-50%.
💡 Speedup Math: If acceptance rate is 80% with 5 draft tokens, expected accepted tokens = 4. Draft + verify takes 85ms vs 250ms sequential = 2.9× speedup. At 50% acceptance, speedup drops to 1.5×.
Other Latency Optimizations
Model parallelism: Split model across multiple GPUs. Each GPU handles part of the computation. Reduces per-token latency but adds inter-GPU communication overhead (1-5ms per synchronization).
Quantization: Convert 32-bit weights to 8-bit or 4-bit. Reduces memory bandwidth by 4-8×, speeding up inference 2-3× with 1-2% quality loss. Essential for fitting large models on limited GPU memory.

💡 Key Takeaways

✓Speculative decoding: small model drafts tokens, large model verifies in parallel

✓With 80% acceptance rate and 5 draft tokens, expect 2.9× speedup over sequential generation

✓Works best for predictable text (code, formulaic language) with 80-90% acceptance rates

✓Model parallelism splits model across GPUs, adds 1-5ms sync overhead per step

✓Quantization (8-bit/4-bit) gives 2-3× speedup with 1-2% quality loss

📌 Interview Tips

1Explain speculative decoding mechanics: draft 5 tokens fast, verify in parallel, accept matches

2Show the speedup math: 85ms draft+verify vs 250ms sequential = 2.9× at 80% acceptance

3Mention quantization trade-off: 4-8× memory reduction, 2-3× speed, 1-2% quality loss

← Back to Text Generation (Beam Search, Sampling, Decoding) Overview