Fraud Detection & Anomaly DetectionReal-time Scoring (Low-latency Inference)Hard⏱️ ~3 min

Deployment, Observability, and Capacity Planning for Production ML Serving

Latency Observability

Track latency at every stage: feature retrieval P50/P99, inference P50/P99, total request P50/P99. Histogram metrics reveal distribution shape—a bimodal distribution suggests two different code paths. Alert on P99 exceeding SLA, not just average. Dashboard tail latency trends to catch degradation before it impacts users.

Key Metrics: Request rate, error rate, latency percentiles (P50, P95, P99), model prediction distribution, feature store hit rate. These six metrics expose most production issues.

Canary Deployments

New model versions may have different latency characteristics. Deploy to 1% of traffic first, compare latency metrics against baseline. If P99 regresses, roll back immediately. Gradual rollout (1% → 10% → 50% → 100%) catches issues before full impact. Shadow mode runs the new model alongside production without affecting responses—useful for validating accuracy before latency testing.

Capacity Planning

Estimate peak QPS from historical traffic patterns plus growth projections. Add 2-3x headroom for traffic spikes (promotions, viral events). Each inference server handles a fixed QPS at target latency—divide peak QPS by per-server capacity to get fleet size. Autoscaling helps but has spin-up latency; pre-scale before known traffic events.

Cost Optimization: GPUs are expensive. Use CPUs for simple models (linear, small trees). Reserve GPUs for neural networks where the throughput gain justifies cost. Profile actual inference time—many GPU deployments are CPU-bound on feature preprocessing.

Graceful Degradation

Design fallback modes: if the model times out, return a default score. If feature store is down, run the model with available features only. Document degradation modes and their accuracy impact before incidents occur.

💡 Key Takeaways
Track six key metrics: request rate, error rate, latency percentiles, prediction distribution, feature store hit rate
Canary deployments (1% → 10% → 50% → 100%) catch latency regressions before full impact
Add 2-3x capacity headroom for traffic spikes; autoscaling has spin-up latency, so pre-scale for known events
📌 Interview Tips
1Alert on P99 latency exceeding SLA, not average—bimodal distributions indicate different code paths worth investigating
2Profile GPU deployments: many are actually CPU-bound on feature preprocessing, wasting expensive GPU resources
← Back to Real-time Scoring (Low-latency Inference) Overview