Natural Language Processing SystemsLLM Serving (KV-cache, Continuous Batching, Speculative Decoding)Hard⏱️ ~3 min

What Are Common Failure Modes in Production LLM Serving?

OUT-OF-MEMORY (OOM) FAILURES

The most common failure in LLM serving. KV cache grows with sequence length. A burst of long-context requests can exhaust memory, crashing the serving process.

Prevention: Set hard limits on max sequence length. Reserve memory headroom (20-30%). Implement request queuing with admission control. Use paged attention to reduce fragmentation.

Recovery: Auto-restart crashed processes. Implement circuit breakers that reject requests when memory is critically low.

LATENCY SPIKES

Sudden latency increases due to: GC pauses, memory swapping to disk, batch size fluctuations, or model loading. Users experience timeouts or degraded experience.

Detection: Monitor p99 latency continuously. Alert on deviations from baseline. Track GPU memory utilization and swap activity.

Mitigation: Use dedicated GPU memory pools. Disable swap on serving nodes. Pre-warm model weights on startup. Implement request timeouts with graceful degradation.

QUALITY DEGRADATION

Model outputs become worse without visible errors. Causes: quantization issues, temperature drift, prompt template bugs, or tokenizer mismatches.

Detection: Monitor output quality metrics (perplexity on test set, task-specific metrics). Track output length distribution. Human evaluation sampling.

Prevention: Version everything (model weights, tokenizer, prompts). Test quality after each deployment. A/B test model changes.

CASCADING FAILURES

One slow request backs up the queue. Queue grows. Memory pressure increases. More requests fail. System becomes unresponsive.

Prevention: Implement load shedding—reject requests above capacity rather than queuing indefinitely. Set queue size limits. Prioritize requests based on SLO tier.

✅ Best Practice: Build defense in depth: memory limits, queue limits, timeouts, circuit breakers, and auto-restart. Assume failures will happen; design for rapid recovery.
💡 Key Takeaways
OOM: most common failure; prevent with max sequence limits, 20-30% memory headroom, admission control
Latency spikes from GC, swap, batch fluctuation; monitor p99, disable swap, pre-warm models on startup
Cascading failures from queue backup; implement load shedding, queue limits, priority tiers
📌 Interview Tips
1Interview Tip: Explain cascading failure pattern: slow request → queue growth → memory pressure → system unresponsive.
2Interview Tip: Describe defense in depth: memory limits + queue limits + timeouts + circuit breakers + auto-restart.
← Back to LLM Serving (KV-cache, Continuous Batching, Speculative Decoding) Overview