Speculative Decoding and Latency Optimization
How It Works
The draft model (7B parameters) generates 5 candidate tokens in 25ms. The target model (70B parameters) would take 250ms to generate those same 5 tokens sequentially. Instead, run the target model once on all 5 draft tokens in parallel: 60ms. If 4 tokens match, you saved 190ms.
Verification: For each draft token, check if the target model agrees. If the target model would have generated the same token (or accepts it probabilistically), keep it. At the first rejection, discard that token and all following drafts. Generate the correct token from the target model and restart drafting.
When It Works Well
Speculative decoding shines when the draft model has high agreement with the target. For predictable text (code with clear patterns, formulaic language), acceptance rates hit 80-90%. For creative text where the target model might choose any of many valid continuations, acceptance drops to 40-50%.
Other Latency Optimizations
Model parallelism: Split model across multiple GPUs. Each GPU handles part of the computation. Reduces per-token latency but adds inter-GPU communication overhead (1-5ms per synchronization).
Quantization: Convert 32-bit weights to 8-bit or 4-bit. Reduces memory bandwidth by 4-8×, speeding up inference 2-3× with 1-2% quality loss. Essential for fitting large models on limited GPU memory.