Natural Language Processing SystemsLLM Serving (KV-cache, Continuous Batching, Speculative Decoding)Hard⏱️ ~3 min

What Are Common Failure Modes in Production LLM Serving?

Production LLM serving faces several critical failure modes that cause user visible errors or silent quality degradation. The most common is out of memory during decode when KV cache grows beyond device capacity. If the scheduler admits too many sequences or underestimates output lengths, allocation fails mid generation. Recovery typically requires aborting requests, which creates noisy retries and poor user experience. The fix is conservative admission control that tracks KV occupancy and predicted lengths, erring on the side of rejecting new requests rather than crashing existing ones. Memory bandwidth bottlenecks cause decode steps to stall even when compute is available. Each decode iteration must fetch large KV tensors per layer from memory. On modern GPUs, streaming multiprocessor (SM) utilization can drop to 30 to 40 percent because compute units idle waiting for memory. This worsens with large batches of long sequences. The symptom is low GPU utilization despite high request volume. Mitigation includes KV quantization to reduce bytes transferred and paged attention to improve memory access patterns. Fragmentation and stranded memory occur when naive contiguous KV buffers leave unusable gaps between sequences of different lengths. Without paged allocation, memory waste can exceed 50 percent under mixed workloads, reducing effective concurrency by half. Paged KV with fixed block sizes solves this by allowing non contiguous allocation, but requires careful implementation of the logical to physical mapping and block recycling. Tail latency blowups happen when a single long prefill monopolizes GPU kernels, causing inter token latency spikes for dozens of decode jobs. Static batching exacerbates this because all requests wait for the slowest one. Chunked prefill and continuous batching mitigate the issue but can increase TTFT for the long prompt itself. The scheduler must balance fairness across requests with throughput goals. Cache eviction can harm coherence in subtle ways. Evicting tokens based on accumulated attention scores seems reasonable but can remove tokens that become important later in the conversation. Users report off topic replies or contradictions in long sessions. Retaining sink tokens from the beginning and recent windows reduces this risk, but the right policy is workload dependent. Prefix caching introduces a dangerous failure mode: cache mixing across users due to incorrect tokenization boundaries or hidden personalization tokens is a security incident that exposes one user's context to another. Strict namespace isolation and validation are mandatory.
💡 Key Takeaways
Out of memory during decode is the most common failure; occurs when scheduler admits too many sequences or underestimates output lengths, requires aborting requests mid generation
Memory bandwidth bottlenecks cause GPU compute to idle waiting for large KV tensor fetches; SM utilization drops to 30 to 40 percent despite high request volume
Fragmentation from contiguous KV buffers wastes 50 percent or more memory under mixed workloads; paged allocation with fixed blocks reduces waste to under 4 percent
Tail latency spikes occur when long prefill monopolizes kernels, delaying decode inter token latency for other users; chunked prefill interleaves work to maintain fairness
Cache eviction based on attention scores can remove tokens that become important later, causing off topic replies or contradictions in long conversations
Prefix caching cache mixing across users due to tokenization errors or personalization bugs is a critical security incident requiring strict namespace isolation
📌 Examples
OOM failure: Scheduler admits 80 sequences with 700 token average, predicts 200 token outputs. Actual outputs average 400 tokens, KV grows from 40 GB to 56 GB on 50 GB budget, crashes
Memory bandwidth: Batch of 64 sequences with 1000 token histories fetches 64 × 1000 × 0.5 MB = 32 GB per decode step. At 2 TB/s bandwidth, takes 16ms just for memory, compute idles
Tail latency: 8000 token prefill takes 3 seconds with static batching; 40 decode requests in same batch see 3 second delay before first token; chunked prefill keeps delay under 100ms
Cache mixing: System prompt "You are a helpful assistant for [USER_ID]" tokenized with ID, prefix cache reused across users, exposes user A's ID to user B
← Back to LLM Serving (KV-cache, Continuous Batching, Speculative Decoding) Overview