ML Infrastructure & MLOpsCI/CD for MLMedium⏱️ ~3 min

Shadow and Canary Deployment for Models

Definition
Shadow deployment runs a candidate model on live traffic without affecting user responses—logging predictions and latency for comparison. Canary deployment sends a small percentage of real traffic to the candidate while monitoring metrics in real time.

SHADOW DEPLOYMENT

Shadow for 10M requests over 2 hours, recording both models outputs, feature fetch times, and inference latencies. This surfaces training-serving skew, numerical differences, feature availability issues, and tail latency problems before any user sees the new model. Cost: doubled inference load. Benefit: catches environment-specific issues like cache behavior, load balancing artifacts, and upstream service variability that offline replay misses.

CANARY PROGRESSION

Typical progression: 1% traffic for 30 min (catch obvious regressions), then 5% for 2 hours, then 25%, then full rollout. At each stage, automated guards check: p95 latency under 50ms, error rate under 0.1%, online metric delta within 2%.

⚠️ Warning: Watch tail latencies (p99, p999). Mean latency can look fine while a small slice of users experiences 500ms responses due to cold cache or autoscaling lag.

AUTOMATED ROLLBACK

If any guard breaches its SLO for sustained period (p95 > 50ms for 15 consecutive minutes, or CTR drop > 2% for 30 minutes), revert to prior model in under 2 minutes. Requirements: keep both model binaries loaded or quickly loadable, maintain feature schema compatibility during rollout window, pre-warm caches the old model needs.

💡 Key Metrics: Prediction distribution shifts (alert if p50 changes > 10%), feature fetch p99, and business KPIs. Shadow mode compares recommendations on same user request between models for offline precision, diversity, novelty before launching A/B test.
💡 Key Takeaways
Shadow deployment doubles inference cost during evaluation (example 10 million requests over 2 hours) but catches training serving skew, feature availability issues, and tail latency problems that offline replay misses
Canary progression: 1 percent for 30 minutes, 5 percent for 2 hours, 25 percent, then full rollout with automated guards watching p95 latency under 50ms, error rate under 0.1 percent, business metric delta within 2 percent
Automated rollback must complete in under 2 minutes when guards breach SLOs, requiring both model versions loaded or quickly loadable and schema compatibility for both feature versions during rollout window
Tail latency (p99, p999) can hide regressions that mean latency misses: A model with 45ms mean but 500ms p99 due to cold cache or autoscaling lag will degrade user experience for a visible slice
Guard duration matters: Sustained breaches (example p95 greater than 50ms for 15 consecutive minutes) prevent false positives from transient spikes, but delay rollback during real issues
Shadow reveals distribution shifts: If candidate model predicts 20 percent higher scores on average than baseline on same requests, likely indicates calibration or numerical skew even if offline metrics looked good
📌 Interview Tips
1Uber ETA prediction canary: Monitors prediction distribution (alerts if p50 predicted time shifts by more than 10 percent), feature fetch p99 latency, driver acceptance rate, starting at 1 percent traffic in a single city before geographic expansion
2Netflix personalization shadow: Runs candidate recommendation model on replayed user requests, logs both models' top 10 recommendations, computes offline precision at 10, diversity (intra list distance), and novelty (popularity decay) before A/B test
3Google Search ranking rollout: Ties canary metrics to experiment platform, automatically halts if statistically significant regression in query success rate, click through rate, or time to success appears within first 6 hours at 5 percent traffic
← Back to CI/CD for ML Overview
Shadow and Canary Deployment for Models | CI/CD for ML - System Overflow