Learn→Model Serving & Inference→Latency Optimization (Batching, Caching, Quantization)→5 of 6

Model Serving & Inference • Latency Optimization (Batching, Caching, Quantization)Hard⏱️ ~3 min

What Are the Critical Failure Modes in Production Inference Optimization?

Memory Exhaustion from KV Cache
The most common catastrophic failure. When concurrent requests with long contexts exceed device memory, the system either crashes with OOM or triggers emergency eviction that destroys in progress sessions. A 7B model with 14 GB weights and 0.5 MB per token KV cache can only support 20 concurrent 1,000 token sessions on a 24 GB GPU before hitting the limit. Traffic bursts that arrive faster than requests complete cause memory to climb until failure. Without admission control, this manifests as sudden service outage rather than graceful degradation. Production systems must monitor memory utilization and start rejecting or queuing requests when utilization exceeds 80%.
Batching Tail Latency Explosions
Static batching forces all requests in a batch to wait for the longest sequence to complete. If one request generates 2,000 tokens while others need only 100, the short requests experience 20x longer latency than necessary. Bursty arrival patterns interact badly with fixed micro batching windows: if 50 requests arrive simultaneously during a normally quiet period, the batching window fills instantly and subsequent arrivals must wait for the entire batch to finish before starting, creating a latency spike that persists for seconds. Continuous batching mitigates this but requires careful tuning of maximum batch size and per request token budgets.
Silent Quality Degradation from Quantization
Outlier activations in certain layers cause large errors when quantized aggressively. These errors accumulate over long sequences, producing late token quality problems: a response might start coherently but degrade into repetition or nonsense after 1,000 tokens. KV cache quantization to INT8 or lower can cause attention distribution drift, where the model attends to slightly wrong tokens and produces subtly incorrect continuations. This is particularly insidious because aggregate metrics like perplexity might show only small changes while specific reasoning tasks fail significantly.
Caching Correctness Risks
Incorrect cache key construction leads to serving stale or wrong results: a prompt that differs only in a parameter value might match a cached prefix and return an answer for the wrong parameter. Model updates invalidate cached responses, but if invalidation fails, users receive outputs from the old model. Personalization makes caching difficult because user specific context creates near zero hit rates unless carefully abstracted, yet over aggressive abstraction risks leaking information across users.

💡 Key Takeaways

✓Memory exhaustion from KV cache growth causes out of memory (OOM) crashes during traffic bursts; admission control at 80% memory utilization prevents catastrophic failure by queuing or rejecting new requests

✓Head of line blocking in static batching makes short requests wait 10× to 20× longer when batched with long sequences; continuous batching and per request token budgets mitigate this tail latency explosion

✓Quantization quality drift from outlier activations accumulates over long contexts, causing late token failures after 1,000+ tokens that are difficult to detect with aggregate metrics like perplexity

✓KV cache quantization to INT8 can cause attention distribution drift where model attends to slightly wrong tokens, producing subtly incorrect reasoning that passes surface level quality checks

✓Prefix caching with incorrect key construction serves wrong cached results when prompts differ only in parameters or whitespace; strict normalization and cache versioning are critical for correctness

✓Response cache invalidation failures after model updates serve stale outputs from old model; Time To Live (TTL) expiration and version tagging in cache keys prevent this but reduce hit rates

📌 Interview Tips

1Amazon observed tail latency spikes during Black Friday traffic when static batching caused 50 ms requests to wait behind 2,000 ms requests; migrating to continuous batching reduced p99 latency by 8×

2Meta internal testing found INT8 KV cache quantization worked well for contexts under 2,000 tokens but caused 5% to 10% quality drop at 6,000+ tokens, requiring fallback to FP16 for long document tasks

3Netflix prefix cache implementation initially matched prompts without normalizing whitespace, causing incorrect responses when users added extra spaces; strict tokenization based matching fixed correctness but reduced hit rate by 15%

4Google serving systems use admission control that starts rejecting requests when GPU memory utilization exceeds 80%, preventing OOM crashes during traffic bursts and maintaining service for in progress sessions

← Back to Latency Optimization (Batching, Caching, Quantization) Overview