ML Infrastructure & MLOps • Shadow Mode DeploymentHard⏱️ ~2 min
Implementing Shadow Mode: Mirroring, Isolation, and Promotion Criteria
Place traffic mirroring at the earliest safe point with full context and minimal latency impact. Common choices are the edge proxy, ingress gateway, or service sidecar. The gateway stamps a correlation ID, model routing context, and sampling decision, then forwards the main request synchronously to the live model. It enqueues the shadow payload to a message bus like Kafka or a bounded in-memory queue. The shadow consumer scales independently using autoscaling policies that keep p99 latency within target, for example 150 milliseconds. Shadow call timeouts should be stricter than live, for example 80ms shadow timeout when live budget is 120ms, to avoid long running computations.
Ensure strict isolation to prevent resource contention. Use separate autoscaling groups, dedicated caches, and read only feature store credentials for the shadow model. Do not share thread pools, connection pools, or circuit breakers with the live path. Budget capacity intentionally: if peak is 40,000 requests per second and median payload is 4 kB, full mirroring adds roughly 1.3 Gbps internal traffic. Many teams sample at 10 to 30 percent initially to validate stability, then increase to 50 percent or more for comprehensive load characterization. Stratify sampling to always include high value or high risk segments at higher rates.
Log both inputs and outputs with strong schema contracts. Store canonicalized request features after preprocessing, baseline prediction, baseline latency, shadow prediction, shadow latency, and resource snapshots like CPU and memory. Emit to an evaluation stream and aggregate hourly. A label joiner batch job handles late arrivals using time windows. Define quantitative promotion criteria with statistical rigor: p95 and p99 latency within Service Level Objective (SLO), accuracy improvement by segment (for example bootstrap confidence interval for AUC delta excludes zero, or NDCG at 10 lift of at least 2 percent), and stable resource utilization. Run for minimum exposure, for example one week covering weekday and weekend patterns with at least 10 million requests, before moving to canary.
💡 Key Takeaways
•Mirroring placement: Edge gateway or sidecar with async queuing adds under 2ms p99 overhead; sync inline mirroring can block for 10 to 50ms and should be avoided
•Strict isolation with separate autoscaling groups, dedicated caches, read only credentials prevents resource contention; shadow should never impact live model latency
•Capacity budgeting: 40K req/sec with 4 kB payloads at 25% sampling adds 325 Mbps traffic and 10K feature store QPS; plan infrastructure before scaling mirroring
•Schema enforcement: Log canonicalized features, predictions, latency, resource metrics with versioning to enable reliable per-request difference analysis and debugging
•Statistical promotion gates: Bootstrap confidence interval for AUC delta, NDCG at 10 relative lift of at least 2%, p99 latency within SLO, minimum 1 week exposure covering 10M requests
•Stricter shadow timeouts: If live has 120ms budget, set shadow to 80ms to detect slow paths early without risking production latency degradation
📌 Examples
Search ranking team uses Envoy sidecar for mirroring, shadow consumer autoscales to maintain p99 under 150ms, logs to Kafka topic with 7 day retention, joins labels in Spark batch job
Recommendation model promotion criteria: p99 latency under 180ms for 1 week, NDCG at 10 improvement of 2% with 95% confidence, disagreement rate under 15% on high value segment, then proceed to 1% canary
Fraud scoring service mirrors 100% of transactions, uses read only Postgres replica for features, enforces 50ms shadow timeout vs 100ms live, stores 1 kB per prediction for 30 days (2.5 TB at 1M daily txns)