ML Infrastructure & MLOpsCI/CD for MLMedium⏱️ ~3 min

Shadow and Canary Deployment for Models

Shadow deployment runs a candidate model on live production traffic without affecting user facing responses, logging predictions and latency for comparison against the deployed model. For example, a ranking model might shadow for 10 million requests over 2 hours, recording both models' outputs, feature fetch times, and inference latencies. This surfaces training serving skew, numerical differences, feature availability issues, and tail latency problems before any user sees the new model. The cost is doubled inference load during the evaluation window, but the safety gain is significant because offline replay cannot catch environment specific issues like cache behavior, load balancing artifacts, or upstream service variability. Canary deployment shifts a small percentage of live traffic to the candidate model and monitors business and operational metrics in real time. A typical progression starts at 1 percent traffic for 30 minutes to catch obvious regressions with minimal user impact. If automated guards pass (p95 latency under 50ms, error rate under 0.1 percent, online metric delta within 2 percent), traffic increases to 5 percent for 2 hours, then 25 percent, then full rollout. Guards must watch tail latencies (p99, p999) because mean latency can look fine while a small slice of users experiences 500ms responses due to cold cache or autoscaling lag. Automated rollback is essential. If any guard breaches its Service Level Objective (SLO) for a sustained period (for example p95 greater than 50ms for 15 consecutive minutes or click through rate drop greater than 2 percent for 30 minutes), the system must revert to the prior model version in under 2 minutes. This requires keeping both model binaries loaded or quickly loadable, maintaining compatibility for both feature schema versions during the rollout window, and pre warming any caches or embeddings the old model needs. Rollback without these safeguards can fail catastrophically, leaving the service in a broken state. Uber runs canaries for fraud and ETA models, starting at 1 percent of requests and monitoring metrics like prediction distribution shifts (alerts if p50 predicted ETA changes by more than 10 percent), feature fetch p99 latency, and business KPIs like driver acceptance rate. Netflix uses shadow mode extensively for personalization models, comparing recommendations on the same user request between models and computing offline precision, diversity, and novelty metrics before launching an A/B test with live traffic. Google's continuous evaluation framework ties canary metrics to experiment platforms, automatically halting rollouts if statistically significant regressions appear in core metrics within the first few hours.
💡 Key Takeaways
Shadow deployment doubles inference cost during evaluation (example 10 million requests over 2 hours) but catches training serving skew, feature availability issues, and tail latency problems that offline replay misses
Canary progression: 1 percent for 30 minutes, 5 percent for 2 hours, 25 percent, then full rollout with automated guards watching p95 latency under 50ms, error rate under 0.1 percent, business metric delta within 2 percent
Automated rollback must complete in under 2 minutes when guards breach SLOs, requiring both model versions loaded or quickly loadable and schema compatibility for both feature versions during rollout window
Tail latency (p99, p999) can hide regressions that mean latency misses: A model with 45ms mean but 500ms p99 due to cold cache or autoscaling lag will degrade user experience for a visible slice
Guard duration matters: Sustained breaches (example p95 greater than 50ms for 15 consecutive minutes) prevent false positives from transient spikes, but delay rollback during real issues
Shadow reveals distribution shifts: If candidate model predicts 20 percent higher scores on average than baseline on same requests, likely indicates calibration or numerical skew even if offline metrics looked good
📌 Examples
Uber ETA prediction canary: Monitors prediction distribution (alerts if p50 predicted time shifts by more than 10 percent), feature fetch p99 latency, driver acceptance rate, starting at 1 percent traffic in a single city before geographic expansion
Netflix personalization shadow: Runs candidate recommendation model on replayed user requests, logs both models' top 10 recommendations, computes offline precision at 10, diversity (intra list distance), and novelty (popularity decay) before A/B test
Google Search ranking rollout: Ties canary metrics to experiment platform, automatically halts if statistically significant regression in query success rate, click through rate, or time to success appears within first 6 hours at 5 percent traffic
← Back to CI/CD for ML Overview
Shadow and Canary Deployment for Models | CI/CD for ML - System Overflow