Definition
Shadow deployment runs a candidate model on live traffic without affecting user responses—logging predictions and latency for comparison. Canary deployment sends a small percentage of real traffic to the candidate while monitoring metrics in real time.
SHADOW DEPLOYMENT
Shadow for 10M requests over 2 hours, recording both models outputs, feature fetch times, and inference latencies. This surfaces training-serving skew, numerical differences, feature availability issues, and tail latency problems before any user sees the new model. Cost: doubled inference load. Benefit: catches environment-specific issues like cache behavior, load balancing artifacts, and upstream service variability that offline replay misses.
CANARY PROGRESSION
Typical progression: 1% traffic for 30 min (catch obvious regressions), then 5% for 2 hours, then 25%, then full rollout. At each stage, automated guards check: p95 latency under 50ms, error rate under 0.1%, online metric delta within 2%.
⚠️ Warning: Watch tail latencies (p99, p999). Mean latency can look fine while a small slice of users experiences 500ms responses due to cold cache or autoscaling lag.
AUTOMATED ROLLBACK
If any guard breaches its SLO for sustained period (p95 > 50ms for 15 consecutive minutes, or CTR drop > 2% for 30 minutes), revert to prior model in under 2 minutes. Requirements: keep both model binaries loaded or quickly loadable, maintain feature schema compatibility during rollout window, pre-warm caches the old model needs.
💡 Key Metrics: Prediction distribution shifts (alert if p50 changes > 10%), feature fetch p99, and business KPIs. Shadow mode compares recommendations on same user request between models for offline precision, diversity, novelty before launching A/B test.
✓Shadow deployment doubles inference cost during evaluation (example 10 million requests over 2 hours) but catches training serving skew, feature availability issues, and tail latency problems that offline replay misses
✓Canary progression: 1 percent for 30 minutes, 5 percent for 2 hours, 25 percent, then full rollout with automated guards watching p95 latency under 50ms, error rate under 0.1 percent, business metric delta within 2 percent
✓Automated rollback must complete in under 2 minutes when guards breach SLOs, requiring both model versions loaded or quickly loadable and schema compatibility for both feature versions during rollout window
✓Tail latency (p99, p999) can hide regressions that mean latency misses: A model with 45ms mean but 500ms p99 due to cold cache or autoscaling lag will degrade user experience for a visible slice
✓Guard duration matters: Sustained breaches (example p95 greater than 50ms for 15 consecutive minutes) prevent false positives from transient spikes, but delay rollback during real issues
✓Shadow reveals distribution shifts: If candidate model predicts 20 percent higher scores on average than baseline on same requests, likely indicates calibration or numerical skew even if offline metrics looked good
1Uber ETA prediction canary: Monitors prediction distribution (alerts if p50 predicted time shifts by more than 10 percent), feature fetch p99 latency, driver acceptance rate, starting at 1 percent traffic in a single city before geographic expansion
2Netflix personalization shadow: Runs candidate recommendation model on replayed user requests, logs both models' top 10 recommendations, computes offline precision at 10, diversity (intra list distance), and novelty (popularity decay) before A/B test
3Google Search ranking rollout: Ties canary metrics to experiment platform, automatically halts if statistically significant regression in query success rate, click through rate, or time to success appears within first 6 hours at 5 percent traffic