Model Serving & Inference • Model Versioning & RollbackMedium⏱️ ~2 min
Shadow Deployment for Risk Free Model Validation
Shadow deployment duplicates 100 percent of production requests to the new model but discards its predictions, serving only the baseline model's outputs to users. This validates inference latency, feature availability, schema compatibility, and resource consumption under real traffic without any user impact. The cost is roughly 2x compute for the inference tier during the shadow window, which typically runs for hours to days depending on risk.
Netflix uses shadow mode to validate prediction parity and latency impact before canary. Engineers compare shadow outputs against baseline predictions and ground truth (when available) to detect distribution shifts or calibration drift. At 10,000 Queries Per Second (QPS), a 24 hour shadow generates 864 million paired predictions for offline analysis. Airbnb runs multi day shadows for ranking and search changes to observe feature drift, infrastructure cost deltas, and edge case behavior that stress tests might miss.
The key tradeoff is cost versus confidence. Shadow gives high confidence with zero user risk, but doubling compute at scale (Uber's millions of predictions per second) can add tens of thousands of dollars per day. Use shadow for high risk changes like model family switches (going from gradient boosted trees to deep neural networks), feature schema migrations, or new infrastructure. For low risk incremental updates, skip directly to 1 percent canary to save cost and accelerate rollout.
💡 Key Takeaways
•Shadow deployment mirrors 100 percent of production traffic to the new model but serves only baseline predictions to users, enabling full validation with zero user impact
•Typical cost is 2x inference compute; at 10,000 QPS a 24 hour shadow generates 864 million paired predictions for comparing outputs, latency distributions, and feature availability
•Use shadow for high risk changes like model family switches (gradient boosted trees to neural networks), feature schema migrations, or infrastructure overhauls where canary blast radius is too risky
•Shadow reveals issues missed by offline tests: feature availability gaps under load, cache interactions, tail latency under contention, and real distribution drift that synthetic data cannot capture
•For low risk incremental updates (hyperparameter tuning, minor retraining), skip shadow and go directly to 1 percent canary to save compute cost and accelerate rollout timelines
📌 Examples
Airbnb runs multi day shadows for search ranking changes, observing infrastructure cost deltas and edge cases before 5 percent canary; this caught a feature backfill gap that would have caused 15 percent fallback rate
Netflix uses shadow to validate prediction parity between baseline and new models, comparing outputs against ground truth when available to detect calibration drift before any traffic shift