Production Failure Modes: Tail Latency, Memory Exhaustion, and Training Serving Skew
Tail Latency Blowups
The most common production failure is tail latency blowups from dynamic batching under spiky traffic: your p50 latency looks great at 15 milliseconds, GPU utilization hovers at 40%, yet p95 latency violates SLOs at 200 milliseconds. This happens when batch formation windows wait for requests that arrive slowly during traffic valleys. Requests sit in queue burning latency budget before any computation starts. The fix is counterintuitive: reduce batch window timeouts or disable batching entirely for low QPS periods, accepting lower device utilization to meet tail latency commitments.
Memory Exhaustion
Device memory exhaustion crashes services silently or causes unpredictable evictions. A model that fits comfortably in GPU memory during development can exceed capacity in production when batching, concurrency, and multiple model versions combine. For example, a 2 GB model with batch size 32 and activation memory of 4 GB per batch running with concurrency 2 needs 2 GB plus 2 times 4 GB equals 10 GB minimum, exceeding many GPU budgets. Teams enforce per model memory budgets at deploy time: batch size multiplied by activation footprint multiplied by concurrency must be under device capacity with 20% headroom. Violating this causes OOM errors mid request, returning cryptic failures to clients.
Training Serving Skew
This creates silent accuracy degradation that only appears in production. Models trained on batch computed features but served with real time features can experience 10% to 20% accuracy drops. A ranking model trained on user embeddings computed daily but served with embeddings computed on demand per request will see distribution shift if the computation differs slightly (different aggregation windows, missing features, ordering changes).
Mitigation Strategy
Combat training serving skew with feature store abstractions that guarantee identical computation offline and online, and automated validation that compares training feature distributions against serving feature distributions in shadow mode before rollout.