Learn→Natural Language Processing Systems→LLM Serving (KV-cache, Continuous Batching, Speculative Decoding)→6 of 6

Natural Language Processing Systems • LLM Serving (KV-cache, Continuous Batching, Speculative Decoding)Hard⏱️ ~3 min

What Are Common Failure Modes in Production LLM Serving?

OUT-OF-MEMORY (OOM) FAILURES
The most common failure in LLM serving. KV cache grows with sequence length. A burst of long-context requests can exhaust memory, crashing the serving process.
Prevention: Set hard limits on max sequence length. Reserve memory headroom (20-30%). Implement request queuing with admission control. Use paged attention to reduce fragmentation.
Recovery: Auto-restart crashed processes. Implement circuit breakers that reject requests when memory is critically low.
LATENCY SPIKES
Sudden latency increases due to: GC pauses, memory swapping to disk, batch size fluctuations, or model loading. Users experience timeouts or degraded experience.
Detection: Monitor p99 latency continuously. Alert on deviations from baseline. Track GPU memory utilization and swap activity.
Mitigation: Use dedicated GPU memory pools. Disable swap on serving nodes. Pre-warm model weights on startup. Implement request timeouts with graceful degradation.
QUALITY DEGRADATION
Model outputs become worse without visible errors. Causes: quantization issues, temperature drift, prompt template bugs, or tokenizer mismatches.
Detection: Monitor output quality metrics (perplexity on test set, task-specific metrics). Track output length distribution. Human evaluation sampling.
Prevention: Version everything (model weights, tokenizer, prompts). Test quality after each deployment. A/B test model changes.
CASCADING FAILURES
One slow request backs up the queue. Queue grows. Memory pressure increases. More requests fail. System becomes unresponsive.
Prevention: Implement load shedding—reject requests above capacity rather than queuing indefinitely. Set queue size limits. Prioritize requests based on SLO tier.
✅ Best Practice: Build defense in depth: memory limits, queue limits, timeouts, circuit breakers, and auto-restart. Assume failures will happen; design for rapid recovery.

💡 Key Takeaways

✓OOM: most common failure; prevent with max sequence limits, 20-30% memory headroom, admission control

✓Latency spikes from GC, swap, batch fluctuation; monitor p99, disable swap, pre-warm models on startup

✓Cascading failures from queue backup; implement load shedding, queue limits, priority tiers

📌 Interview Tips

1Interview Tip: Explain cascading failure pattern: slow request → queue growth → memory pressure → system unresponsive.

2Interview Tip: Describe defense in depth: memory limits + queue limits + timeouts + circuit breakers + auto-restart.

← Back to LLM Serving (KV-cache, Continuous Batching, Speculative Decoding) Overview