What Are Common Failure Modes in Production LLM Serving?
OUT-OF-MEMORY (OOM) FAILURES
The most common failure in LLM serving. KV cache grows with sequence length. A burst of long-context requests can exhaust memory, crashing the serving process.
Prevention: Set hard limits on max sequence length. Reserve memory headroom (20-30%). Implement request queuing with admission control. Use paged attention to reduce fragmentation.
Recovery: Auto-restart crashed processes. Implement circuit breakers that reject requests when memory is critically low.
LATENCY SPIKES
Sudden latency increases due to: GC pauses, memory swapping to disk, batch size fluctuations, or model loading. Users experience timeouts or degraded experience.
Detection: Monitor p99 latency continuously. Alert on deviations from baseline. Track GPU memory utilization and swap activity.
Mitigation: Use dedicated GPU memory pools. Disable swap on serving nodes. Pre-warm model weights on startup. Implement request timeouts with graceful degradation.
QUALITY DEGRADATION
Model outputs become worse without visible errors. Causes: quantization issues, temperature drift, prompt template bugs, or tokenizer mismatches.
Detection: Monitor output quality metrics (perplexity on test set, task-specific metrics). Track output length distribution. Human evaluation sampling.
Prevention: Version everything (model weights, tokenizer, prompts). Test quality after each deployment. A/B test model changes.
CASCADING FAILURES
One slow request backs up the queue. Queue grows. Memory pressure increases. More requests fail. System becomes unresponsive.
Prevention: Implement load shedding—reject requests above capacity rather than queuing indefinitely. Set queue size limits. Prioritize requests based on SLO tier.