Natural Language Processing SystemsText Generation (Beam Search, Sampling, Decoding)Hard⏱️ ~2 min

Speculative Decoding and Latency Optimization

Key Insight
Speculative decoding uses a small, fast model to draft multiple tokens, then verifies them in parallel with the large model. If the draft is accepted, you get multiple tokens for the cost of one large-model forward pass.

How It Works

The draft model (7B parameters) generates 5 candidate tokens in 25ms. The target model (70B parameters) would take 250ms to generate those same 5 tokens sequentially. Instead, run the target model once on all 5 draft tokens in parallel: 60ms. If 4 tokens match, you saved 190ms.

Verification: For each draft token, check if the target model agrees. If the target model would have generated the same token (or accepts it probabilistically), keep it. At the first rejection, discard that token and all following drafts. Generate the correct token from the target model and restart drafting.

When It Works Well

Speculative decoding shines when the draft model has high agreement with the target. For predictable text (code with clear patterns, formulaic language), acceptance rates hit 80-90%. For creative text where the target model might choose any of many valid continuations, acceptance drops to 40-50%.

💡 Speedup Math: If acceptance rate is 80% with 5 draft tokens, expected accepted tokens = 4. Draft + verify takes 85ms vs 250ms sequential = 2.9× speedup. At 50% acceptance, speedup drops to 1.5×.

Other Latency Optimizations

Model parallelism: Split model across multiple GPUs. Each GPU handles part of the computation. Reduces per-token latency but adds inter-GPU communication overhead (1-5ms per synchronization).

Quantization: Convert 32-bit weights to 8-bit or 4-bit. Reduces memory bandwidth by 4-8×, speeding up inference 2-3× with 1-2% quality loss. Essential for fitting large models on limited GPU memory.

💡 Key Takeaways
Speculative decoding: small model drafts tokens, large model verifies in parallel
With 80% acceptance rate and 5 draft tokens, expect 2.9× speedup over sequential generation
Works best for predictable text (code, formulaic language) with 80-90% acceptance rates
Model parallelism splits model across GPUs, adds 1-5ms sync overhead per step
Quantization (8-bit/4-bit) gives 2-3× speedup with 1-2% quality loss
📌 Interview Tips
1Explain speculative decoding mechanics: draft 5 tokens fast, verify in parallel, accept matches
2Show the speedup math: 85ms draft+verify vs 250ms sequential = 2.9× at 80% acceptance
3Mention quantization trade-off: 4-8× memory reduction, 2-3× speed, 1-2% quality loss
← Back to Text Generation (Beam Search, Sampling, Decoding) Overview
Speculative Decoding and Latency Optimization | Text Generation (Beam Search, Sampling, Decoding) - System Overflow