Fraud Detection & Anomaly DetectionReal-time Scoring (Low-latency Inference)Hard⏱️ ~3 min

Deployment, Observability, and Capacity Planning for Production ML Serving

Getting a real time scoring system into production and keeping it running reliably requires rigorous deployment practices, deep observability, and proactive capacity planning. These operational concerns often determine whether your system hits SLOs under real world traffic and failure conditions. Deployment starts with regional canaries, where you route 1 to 5 percent of production traffic to the new model or code version while the rest continues on the stable version. Automated rollback triggers if p99 latency regresses by more than 10 percent or error rates exceed thresholds, for example 0.1 percent. Shadow traffic is invaluable for evaluating new models: you send production requests to both the old and new models, serve the old model's response to users, and log both predictions for offline comparison. This validates numerical parity and catches issues like input preprocessing differences that would cause training serving skew. Observability must cover latency, errors, and saturation. Track p50, p95, and p99 latencies for the entire request and for each subspan like feature fetch, model compute, and post processing. Break down latency by stage so you can immediately see which component regressed after a deploy. For LLMs, track time to first token and tokens per second separately since they affect perceived responsiveness differently. Monitor feature freshness by logging the timestamp of the most recent feature update and alerting if staleness exceeds SLAs. Track cache hit rates since a drop from 95 to 80 percent can double your p99 latency. Saturation metrics reveal when you are running out of headroom. Monitor CPU utilization, memory usage, and queue depth. A scoring service consistently above 70 percent CPU will struggle during traffic spikes. Queue depth above 10 percent of capacity indicates you are close to saturation and latency will soon spike. Set SLO burn alerts that trigger before users feel the impact. For example, if your SLO is 99.9 percent of requests under 100ms, alert when you have consumed 10 percent of your error budget in an hour, giving you time to investigate before you blow the monthly budget. Capacity planning requires estimating cycles per request and planning for 2 to 3 times peak traffic. A CPU server running a gradient boosted tree might handle 1000 requests per second at p99 under 20 milliseconds. A GPU running a transformer encoder with dynamic batching might handle 5000 to 20,000 inferences per second depending on batch size and sequence length. Keep buffer capacity for failover scenarios where an entire availability zone goes down and traffic shifts to the remaining zones. Test with synthetic burst loads and fault injection, for example dropping a feature dependency and measuring how tail latency and error rates change. Multi region strategy places serving close to users to reduce network latency. Stripe, PayPal, Amazon, and Uber operate active active deployments across multiple regions, with model artifacts and feature stores replicated. Health checks and session affinity ensure sticky users land on healthy instances. Fast failover, detecting failure and redirecting traffic within seconds, is critical. Cross region feature reads are avoided when possible since they add 50 to 200 milliseconds of unpredictable tail latency. Instead, each region maintains a full copy of the online feature store.
💡 Key Takeaways
Regional canaries route 1 to 5 percent of traffic to new versions with automated rollback if p99 latency regresses more than 10 percent or error rate exceeds 0.1 percent
Shadow traffic sends production requests to both old and new models, serving the old response and logging both for offline comparison to catch training serving skew
Latency breakdown by stage, including feature fetch, model compute, and post processing, lets you attribute regressions immediately after deployment or traffic changes
Saturation metrics like CPU above 70 percent and queue depth above 10 percent of capacity signal impending latency spikes, triggering capacity addition before SLO breach
Capacity planning estimates cycles per request and provisions 2 to 3 times peak traffic, testing with synthetic bursts and fault injection like dropping feature dependencies
Multi region deployments replicate model artifacts and feature stores, avoiding cross region reads that add 50 to 200ms tail latency and using fast failover within seconds
📌 Examples
Stripe uses canary deploys with 2 percent traffic for 30 minutes, automatically rolling back if p99 latency exceeds 80ms or error rate rises above 0.05 percent
PayPal shadows new fraud models against production traffic for 24 hours, comparing AUC and precision at operating thresholds before promoting to serve live traffic
Uber monitors feature freshness with alerts when driver location features exceed 90 seconds staleness, indicating streaming pipeline lag during traffic surges
Amazon provision CPU autoscaling to handle Black Friday peaks at 3 times normal traffic, with pre warming that spins up capacity 1 hour before the surge
← Back to Real-time Scoring (Low-latency Inference) Overview
Deployment, Observability, and Capacity Planning for Production ML Serving | Real-time Scoring (Low-latency Inference) - System Overflow