Learn→Model Serving & Inference→Serving Infrastructure (TensorFlow Serving, TorchServe, Triton)→5 of 6
Model Serving & Inference • Serving Infrastructure (TensorFlow Serving, TorchServe, Triton)Hard⏱️ ~3 min
Production Failure Modes: Tail Latency, Memory Exhaustion, and Training Serving Skew
Model serving infrastructure fails in predictable but often counterintuitive ways. The most common production failure is tail latency blowups from dynamic batching under spiky traffic: your p50 latency looks great at 15 milliseconds, GPU utilization hovers at 40%, yet p95 latency violates Service Level Objectives (SLOs) at 200 milliseconds. This happens when batch formation windows wait for requests that arrive slowly during traffic valleys. Requests sit in queue burning latency budget before any computation starts. The fix is counterintuitive: reduce batch window timeouts or disable batching entirely for low queries per second (QPS) periods, accepting lower device utilization to meet tail latency commitments.
Device memory exhaustion crashes services silently or causes unpredictable evictions. A model that fits comfortably in GPU memory during development can exceed capacity in production when batching, concurrency, and multiple model versions combine. For example, a 2 Gigabyte (GB) model with batch size 32 and activation memory of 4 GB per batch running with concurrency 2 needs 2 GB plus 2 times 4 GB equals 10 GB minimum, exceeding many GPU budgets. Teams at NVIDIA enforce per model memory budgets at deploy time: batch size multiplied by activation footprint multiplied by concurrency must be under device capacity with 20% headroom. Violating this causes out of memory (OOM) errors mid request, returning cryptic failures to clients.
Training serving skew creates silent accuracy degradation that only appears in production. Models trained on batch computed features but served with real time features can experience 10% to 20% accuracy drops. A ranking model trained on user embeddings computed daily but served with embeddings computed on demand per request will see distribution shift if the computation differs slightly (different aggregation windows, missing features, ordering changes). Meta and Google combat this with feature store abstractions that guarantee identical computation offline and online, and automated validation that compares training feature distributions against serving feature distributions in shadow mode before rollout.
💡 Key Takeaways
•Tail latency from batching under spiky traffic: p50 at 15 milliseconds and 40% GPU utilization but p95 at 200 milliseconds because batch windows wait for slow arriving requests, consuming latency budget in queue before computation
•GPU memory exhaustion formula: model weights plus batch size multiplied by activation memory multiplied by concurrency must stay under device capacity with 20% headroom to avoid out of memory crashes
•Training serving skew causes silent accuracy drops of 10% to 20% when models trained on batch computed features are served with real time features that differ in aggregation windows, missing values, or computation order
•Cold start and version thrash: loading multiple large model versions causes memory churn and long warmup times during rollouts, mitigation is limiting active versions to under 2 on memory constrained GPUs and prewarm before traffic shift
•CPU bound preprocessing hides GPU underutilization: heavy decode, resize, or augmentation steps saturate CPUs while GPUs idle, appearing as low device utilization despite saturated service
•Noisy neighbor in multi tenant GPU: hot model soaks scheduler causing head of line blocking for other models, requires per model queues and weighted fair sharing or device pinning for isolation
📌 Examples
Uber ride matching service disabled dynamic batching during low QPS overnight hours (under 100 requests per second) after observing p99 latency spike from 50 milliseconds to 180 milliseconds, accepting 30% lower GPU utilization to maintain SLO
Medical imaging service hit OOM errors after adding second model version for A/B test: each 3 GB model with batch 8 and activations 5 GB needed 3 plus 8 times 5 equals 43 GB per version, exceeding 16 GB V100 capacity, fixed by limiting batch size to 4
Pinterest recommendation model showed 15% precision drop in production versus offline validation due to training serving skew: training used 7 day user embedding aggregation, serving used 1 day due to data pipeline lag, fixed with feature store guaranteeing identical windows