Learn→Model Serving & Inference→Serving Infrastructure (TensorFlow Serving, TorchServe, Triton)→5 of 6

Model Serving & Inference • Serving Infrastructure (TensorFlow Serving, TorchServe, Triton)Hard⏱️ ~3 min

Production Failure Modes: Tail Latency, Memory Exhaustion, and Training Serving Skew

Tail Latency Blowups
The most common production failure is tail latency blowups from dynamic batching under spiky traffic: your p50 latency looks great at 15 milliseconds, GPU utilization hovers at 40%, yet p95 latency violates SLOs at 200 milliseconds. This happens when batch formation windows wait for requests that arrive slowly during traffic valleys. Requests sit in queue burning latency budget before any computation starts. The fix is counterintuitive: reduce batch window timeouts or disable batching entirely for low QPS periods, accepting lower device utilization to meet tail latency commitments.
Memory Exhaustion
Device memory exhaustion crashes services silently or causes unpredictable evictions. A model that fits comfortably in GPU memory during development can exceed capacity in production when batching, concurrency, and multiple model versions combine. For example, a 2 GB model with batch size 32 and activation memory of 4 GB per batch running with concurrency 2 needs 2 GB plus 2 times 4 GB equals 10 GB minimum, exceeding many GPU budgets. Teams enforce per model memory budgets at deploy time: batch size multiplied by activation footprint multiplied by concurrency must be under device capacity with 20% headroom. Violating this causes OOM errors mid request, returning cryptic failures to clients.
Training Serving Skew
This creates silent accuracy degradation that only appears in production. Models trained on batch computed features but served with real time features can experience 10% to 20% accuracy drops. A ranking model trained on user embeddings computed daily but served with embeddings computed on demand per request will see distribution shift if the computation differs slightly (different aggregation windows, missing features, ordering changes).
Mitigation Strategy
Combat training serving skew with feature store abstractions that guarantee identical computation offline and online, and automated validation that compares training feature distributions against serving feature distributions in shadow mode before rollout.

💡 Key Takeaways

✓Tail latency from batching under spiky traffic: p50 at 15 milliseconds and 40% GPU utilization but p95 at 200 milliseconds because batch windows wait for slow arriving requests, consuming latency budget in queue before computation

✓GPU memory exhaustion formula: model weights plus batch size multiplied by activation memory multiplied by concurrency must stay under device capacity with 20% headroom to avoid out of memory crashes

✓Training serving skew causes silent accuracy drops of 10% to 20% when models trained on batch computed features are served with real time features that differ in aggregation windows, missing values, or computation order

✓Cold start and version thrash: loading multiple large model versions causes memory churn and long warmup times during rollouts, mitigation is limiting active versions to under 2 on memory constrained GPUs and prewarm before traffic shift

✓CPU bound preprocessing hides GPU underutilization: heavy decode, resize, or augmentation steps saturate CPUs while GPUs idle, appearing as low device utilization despite saturated service

✓Noisy neighbor in multi tenant GPU: hot model soaks scheduler causing head of line blocking for other models, requires per model queues and weighted fair sharing or device pinning for isolation

📌 Interview Tips

1Uber ride matching service disabled dynamic batching during low QPS overnight hours (under 100 requests per second) after observing p99 latency spike from 50 milliseconds to 180 milliseconds, accepting 30% lower GPU utilization to maintain SLO

2Medical imaging service hit OOM errors after adding second model version for A/B test: each 3 GB model with batch 8 and activations 5 GB needed 3 plus 8 times 5 equals 43 GB per version, exceeding 16 GB V100 capacity, fixed by limiting batch size to 4

3Pinterest recommendation model showed 15% precision drop in production versus offline validation due to training serving skew: training used 7 day user embedding aggregation, serving used 1 day due to data pipeline lag, fixed with feature store guaranteeing identical windows

← Back to Serving Infrastructure (TensorFlow Serving, TorchServe, Triton) Overview