Blue Green and Canary Deployment Patterns for Model Rollout

Blue Green Deployment
Blue green runs old and new model stacks in parallel, then switches all traffic at once by flipping a load balancer or router configuration. The old stack stays warm for instant rollback if issues emerge. Netflix uses red black (their term for blue green) extensively: the flip takes seconds, and the old server group remains ready for immediate reversion. This pattern doubles compute during the transition window but provides the fastest rollback path. A typical blue green window at Netflix lasts 15 to 30 minutes before the old stack is torn down.
Canary Deployment
Canary gradually shifts traffic from 1 percent to 5 percent to 25 percent to 50 percent to 100 percent while monitoring SLOs and KPIs at each stage. Uber's typical path for ranking or ETA models follows shadow then 1 to 5 percent canary, enforcing session stickiness so individual users see consistent predictions. Each canary stage runs for minutes to hours depending on metric confidence: infrastructure metrics like p99 latency stabilize in 5 to 30 minutes, but business KPIs like conversion rate need hours to achieve statistical significance.
The Speed vs Safety Trade-off
Blue green catches regressions quickly across 100 percent of traffic but risks larger blast radius. If the new model has a bug, all users are affected until rollback completes. Canary limits impact to small cohorts but extends rollout timelines and requires statistical rigor to detect small KPI deltas. Detecting a 0.5 percent conversion rate drop requires tens of thousands of samples, which might take hours at lower traffic volumes.
Production Configuration
LinkedIn runs billions of predictions daily with tens of milliseconds p99 per subcall budgets; their canaries start at 1 percent with strict guardrails on latency inflation and error rate spikes to protect aggregate page load times. Guardrails include: p99 latency inflation less than 20 percent, error rate increase less than 0.5 percentage points, CPU utilization delta less than 10 percent. Any breach triggers automatic rollback within 5 minutes.

💡 Key Takeaways

✓Blue green flips 100 percent of traffic in seconds via load balancer switch and keeps the old stack warm for instant rollback, but doubles compute during transition and risks larger blast radius

✓Canary starts at 1 to 5 percent with session stickiness and ramps to 25/50/100 percent while monitoring; infrastructure SLOs stabilize in 5 to 30 minutes but business KPIs need hours for statistical confidence

✓Netflix uses automated canary analysis (Kayenta tool) to compare baseline and canary metrics with statistical significance tests, triggering rollback on latency or error rate deviations

✓Uber enforces p99 inference latency budgets under 50 milliseconds for online services; canary stages monitor both infrastructure regressions (CPU, memory, tail latencies) and business KPIs like dispatch accuracy

✓Short canary windows catch infrastructure failures quickly but may miss small KPI deltas requiring hours of data; low traffic segments need broader canaries or longer windows for statistical power

📌 Interview Tips

1Airbnb's Bighead platform runs shadow deployments for several days mirroring 100 percent of production requests to observe drift and cost, then graduates to 5 percent canary before blue green promotion for instant rollback capability

2LinkedIn starts ranking model canaries at 1 percent with strict p99 latency guardrails (tens of milliseconds per subcall) to protect aggregate page latency, ramping over hours as business metrics converge

← Back to Model Versioning & Rollback Overview