Learn→Natural Language Processing Systems→LLM Serving (KV-cache, Continuous Batching, Speculative Decoding)→5 of 6
Natural Language Processing Systems • LLM Serving (KV-cache, Continuous Batching, Speculative Decoding)Hard⏱️ ~3 min
What Are the Key Trade-offs in LLM Serving Optimizations?
LLM serving optimizations involve fundamental trade-offs between latency, throughput, memory, accuracy, and operational complexity. Larger batches increase tokens per second but also increase per request latency because each sequence waits longer for kernel execution. Continuous batching reduces idle time and improves throughput but can increase time to first token (TTFT) if prefill is chunked too aggressively or if the scheduler prioritizes throughput over immediacy. Tuning batch size and chunk size requires balancing these competing goals based on service level objectives (SLOs).
Memory versus recomputation is the central tension. KV caching saves massive amounts of compute by avoiding quadratic attention recomputation but consumes large GPU memory that limits batch size and sequence length. Evicting KV pages reduces memory pressure but forces recomputation or context truncation. Sliding window attention caps memory growth but loses long range dependencies that some models rely on for quality. Each approach makes sense in different scenarios: use full KV caching when memory allows, apply compression when concurrency demands it, and use sliding windows only when the model architecture supports it.
Accuracy versus memory compression introduces subtle quality risks. Quantizing KV cache to FP8 or INT8 cuts memory by 2 times to 4 times but can degrade quality on tasks with rare or long range dependencies. Cache eviction based on attention scores removes tokens that might become important later, causing coherence issues in long conversations. Techniques like retaining sink tokens and recent windows improve robustness, but tuning is workload specific and requires careful A/B testing. The failure mode is silent quality degradation that users notice over time.
Speculative decoding trades simplicity for potential speedup. Adding a draft model introduces another deployment artifact, extra control flow, verification kernels, and dual KV cache management. If acceptance rates drop below 30 to 40 percent due to domain shift or task mismatch, the overhead outweighs benefits and can increase latency. Production systems must monitor acceptance rates continuously and fall back to standard decode when speculation becomes counterproductive. Colocating models avoids PCIe bottlenecks but requires careful memory budgeting.
Resource isolation versus utilization creates product level tension. Mixing diverse workloads, short chat turns with long document analysis, increases GPU utilization but complicates fairness. Without admission control and scheduling policies, short requests can be delayed by long prefill jobs, violating latency SLOs. In multi tenant clusters this becomes a customer satisfaction issue. The right balance depends on whether the service optimizes for cost efficiency through high utilization or user experience through strict latency guarantees.
💡 Key Takeaways
•Larger batches increase tokens per second throughput but also increase per request latency as sequences wait longer; continuous batching trades TTFT for overall utilization
•KV caching eliminates quadratic recomputation but consumes large memory; eviction forces recomputation or truncation; sliding windows cap memory but lose long range context
•KV quantization to FP8 or INT8 cuts memory 2x to 4x but risks quality loss on rare dependencies; cache eviction can harm coherence in long conversations with silent degradation
•Speculative decoding adds deployment complexity with draft model and verification; below 30 to 40 percent acceptance rate, overhead outweighs speedup and increases latency
•Mixing workloads improves utilization but complicates fairness; short requests delayed by long prefill violate SLOs without admission control and scheduling policies
•Cost versus quality: smaller target with speculation can match larger model throughput; limited budget and strict SLAs may favor smaller model with batching over large model with speculation
📌 Examples
Batch size 64 achieves 10,000 tokens/second but 200ms per request latency; batch size 16 yields 6,000 tokens/second but 80ms latency for latency sensitive applications
Full KV cache for 64 sequences at 800 tokens uses 25.6 GB; FP8 quantization reduces to 12.8 GB, doubling concurrency but causing 2% quality drop on reasoning tasks
Speculative decoding with 70% acceptance gives 1.8x speedup; domain shift reduces acceptance to 25%, verification overhead causes 1.2x slowdown versus standard decode
Multi tenant cluster: Long 10,000 token prefill takes 5 seconds, blocking 50 short requests and violating 500ms TTFT SLA; chunked prefill to 512 tokens keeps TTFT under 200ms