A/B Testing & Experimentation • Ramp-up Strategies & Canary AnalysisMedium⏱️ ~3 min
Trade Offs: Canary vs Blue Green vs Shadow Deployment
Canary deployment prioritizes safety and data driven decisions at the cost of speed and operational complexity. A full canary ramp takes 24 to 48 hours versus blue green cutover in minutes. You operate two model versions in parallel, requiring 5 to 10 percent extra capacity when canary reaches 25 percent traffic. If the new model adds 20 percent CPU per request, the combined load during transition increases infrastructure cost proportionally. Observability must be mature: you need per version metrics, stratified cohorts, and statistical rigor to avoid false rollback or false confidence.
Use canary for high risk changes where offline testing cannot fully validate production behavior: new ranking models, feature pipeline changes, fraud detection rule updates, or any change with user facing impact. ML models are especially suited to canary because they can pass offline validation with 0.82 Mean Average Precision (MAP) but fail online due to distribution shift, feedback loops, or training serving skew causing 15 percent accuracy drop.
Blue green deployment suits schema incompatible changes or situations requiring fast, atomic cutover. You run full parallel stacks, validate the green stack with synthetic checks, then switch all traffic instantly via load balancer reweight. This avoids mixed version states but requires double capacity during the switch and provides no gradual user impact assessment. Use blue green when traffic shaping is complex or when you need instant rollback, for example database schema migrations behind a compatibility layer.
Shadow deployment (also called dark launch) duplicates production requests to the new version without affecting user decisions. At 5 percent shadow traffic, the system validates latency (canary P99 220ms vs baseline 210ms), memory usage (1.2 gigabytes vs 1.0 gigabyte per replica), and feature availability (null rate under 0.3 percent) at production scale. Shadow proves the system works but cannot evaluate user impact metrics like CTR or conversion. Uber uses shadow for feature validation, then follows with a canary to measure product metrics. Shadow is also useful for warming caches and indexes before live traffic arrives, reducing cold start latency spikes.
💡 Key Takeaways
•Canary takes 24 to 48 hours with 5 to 10% extra capacity at 25% traffic, trades speed for safety and data driven product metric evaluation
•Blue green enables atomic cutover in minutes but requires 2x full capacity and validates only system health via synthetic checks, not user impact
•Shadow deployment at 5% traffic validates latency, memory, and feature availability without affecting users, useful for cache warming and feature pipeline checks
•Canary essential for ML because models pass offline with 0.82 MAP but fail online due to distribution shift or training serving skew causing 15% accuracy drop
•Operational cost: New model using 20% more CPU means 25% canary adds 5% total compute during ramp, versus blue green doubling cost briefly
📌 Examples
Uber Michelangelo: Shadow at 5% for 1 hour validates feature null rate under 0.3%, then canary at 1% measures CTR impact over 2 hours
Blue green for schema migration: Run both stacks with compatibility layer, validate green with synthetic load, switch load balancer atomically, drain blue after 10 minutes
Netflix canary: 1% → 5% → 25% over 14 hours evaluating session length and next day retention, versus blue green would miss long term engagement effects