ML Infrastructure & MLOps • Shadow Mode DeploymentMedium⏱️ ~2 min
Shadow Mode Architecture and Traffic Flow
A production shadow deployment for a recommendation service handling 40,000 requests per second starts with request mirroring at the edge gateway. The gateway authenticates, applies rate limits, stamps a correlation ID and sampling decision, then duplicates the request payload. The live request flows synchronously to the current model service with a p95 latency of 80ms and p99 of 130ms. The shadow copy goes to a non-blocking queue consumed asynchronously by the candidate model service.
Both models call the same feature service in read only mode, hitting shared caches for user profiles and item embeddings. The shadow service produces rankings and pushes predictions to an evaluation stream with correlation ID, request context, model version, and resource metrics. A difference analyzer computes per-request deltas: for ranking it calculates Kendall tau correlation or NDCG difference, for classification it logs agreement and confidence gaps. Asynchronous mirroring typically adds less than 2ms p99 overhead at the gateway.
A label joiner service runs in batch mode, merging shadow outputs with ground truth using time windows to handle delayed labels. If conversion labels arrive within 7 days, the joiner computes daily aggregates of AUC, precision at k equals 10, or calibration error by segment. Service Level Objective (SLO) dashboards display p50, p95, and p99 latency by model, throughput, error rates, and CPU utilization. Alerts fire if shadow p99 exceeds 180ms or disagreement rate spikes above a threshold for high value segments.
💡 Key Takeaways
•Asynchronous mirroring decouples live and shadow paths: Gateway adds under 2ms p99 overhead, shadow consumer scales independently to maintain target latency
•Strict read only isolation: Shadow model uses separate caches, autoscaling groups, and read only credentials to prevent side effects or resource contention
•Correlation ID tracking: Every request stamped with unique ID enables precise per-request difference analysis and later label joins despite processing lag
•Sampled traffic strategy: Start at 1 to 5 percent for stability validation, increase to 25 to 50 percent for load characterization, saving compute cost while maximizing learning
•At 40K req/sec with 4 kB payloads, full mirroring adds 1.3 Gbps internal traffic and doubles feature store Query Per Second (QPS); 25 percent sampling reduces to 325 Mbps
📌 Examples
LinkedIn search team mirrors 10% of queries to shadow ranker, evaluates NDCG at 10 improvements while monitoring that p99 stays under production budget
Uber ETA prediction model shadowed at 30% traffic: label joiner handles trips completing hours later, computes MAPE daily by city and time of day
E-commerce pricing engine logs 1 kB per shadow prediction; at 5,000 req/sec and 50% mirroring generates 216 GB of shadow logs daily for 7 day retention window