Shadow Deployment for Risk Free Model Validation
How Shadow Mode Works
Shadow deployment duplicates 100 percent of production requests to the new model but discards its predictions, serving only the baseline model's outputs to users. This validates inference latency, feature availability, schema compatibility, and resource consumption under real traffic without any user impact. The new model runs in production conditions, receiving exactly the same requests as the baseline, but its outputs are logged and analyzed rather than returned to users.
What Shadow Validates
Netflix uses shadow mode to validate prediction parity and latency impact before canary. Engineers compare shadow outputs against baseline predictions and ground truth (when available) to detect distribution shifts or calibration drift. At 10,000 QPS, a 24 hour shadow generates 864 million paired predictions for offline analysis. Key metrics include: prediction distribution divergence (KL divergence, histogram distance), calibration curve alignment, latency percentiles, memory usage, and error rate.
Duration and Cost
Airbnb runs multi day shadows for ranking and search changes to observe feature drift, infrastructure cost deltas, and edge case behavior that stress tests might miss. The cost is roughly 2x compute for the inference tier during the shadow window, which typically runs for hours to days depending on risk. At Uber's scale (millions of predictions per second), shadow mode can add tens of thousands of dollars per day in compute costs.
When to Use Shadow
Use shadow for high risk changes like model family switches (going from gradient boosted trees to deep neural networks), feature schema migrations, infrastructure changes (new serving framework, new hardware), or models with critical business impact. For low risk incremental updates, skip directly to 1 percent canary to save cost and accelerate rollout. The key question: what is the cost of a regression reaching production versus the cost of extended shadow validation?