Deployment, Observability, and Capacity Planning for Production ML Serving
Latency Observability
Track latency at every stage: feature retrieval P50/P99, inference P50/P99, total request P50/P99. Histogram metrics reveal distribution shape—a bimodal distribution suggests two different code paths. Alert on P99 exceeding SLA, not just average. Dashboard tail latency trends to catch degradation before it impacts users.
Key Metrics: Request rate, error rate, latency percentiles (P50, P95, P99), model prediction distribution, feature store hit rate. These six metrics expose most production issues.
Canary Deployments
New model versions may have different latency characteristics. Deploy to 1% of traffic first, compare latency metrics against baseline. If P99 regresses, roll back immediately. Gradual rollout (1% → 10% → 50% → 100%) catches issues before full impact. Shadow mode runs the new model alongside production without affecting responses—useful for validating accuracy before latency testing.
Capacity Planning
Estimate peak QPS from historical traffic patterns plus growth projections. Add 2-3x headroom for traffic spikes (promotions, viral events). Each inference server handles a fixed QPS at target latency—divide peak QPS by per-server capacity to get fleet size. Autoscaling helps but has spin-up latency; pre-scale before known traffic events.
Cost Optimization: GPUs are expensive. Use CPUs for simple models (linear, small trees). Reserve GPUs for neural networks where the throughput gain justifies cost. Profile actual inference time—many GPU deployments are CPU-bound on feature preprocessing.
Graceful Degradation
Design fallback modes: if the model times out, return a default score. If feature store is down, run the model with available features only. Document degradation modes and their accuracy impact before incidents occur.