Model Serving & InferenceModel Versioning & RollbackMedium⏱️ ~2 min

Blue Green and Canary Deployment Patterns for Model Rollout

Blue green deployment runs old and new model stacks in parallel, then switches all traffic at once by flipping a load balancer or router configuration. The old stack stays warm for instant rollback if issues emerge. Netflix uses red black (their term for blue green) extensively: the flip takes seconds, and the old server group remains ready for immediate reversion. This pattern doubles compute during the transition window but provides the fastest rollback path. Canary deployment gradually shifts traffic from 1 percent to 5 percent to 25 percent to 50 percent to 100 percent while monitoring Service Level Objectives (SLOs) and Key Performance Indicators (KPIs) at each stage. Uber's typical path for ranking or Estimated Time of Arrival (ETA) models follows shadow then 1 to 5 percent canary, enforcing session stickiness so individual users see consistent predictions. Each canary stage runs for minutes to hours depending on metric confidence: infrastructure metrics like p99 latency stabilize in 5 to 30 minutes, but business KPIs like conversion rate need hours to achieve statistical significance. The tradeoff is speed versus safety. Blue green catches regressions quickly across 100 percent of traffic but risks larger blast radius. Canary limits impact to small cohorts but extends rollout timelines and requires statistical rigor to detect small KPI deltas. LinkedIn runs billions of predictions daily with tens of milliseconds p99 per subcall budgets; their canaries start at 1 percent with strict guardrails on latency inflation and error rate spikes to protect aggregate page load times.
💡 Key Takeaways
Blue green flips 100 percent of traffic in seconds via load balancer switch and keeps the old stack warm for instant rollback, but doubles compute during transition and risks larger blast radius
Canary starts at 1 to 5 percent with session stickiness and ramps to 25/50/100 percent while monitoring; infrastructure SLOs stabilize in 5 to 30 minutes but business KPIs need hours for statistical confidence
Netflix uses automated canary analysis (Kayenta tool) to compare baseline and canary metrics with statistical significance tests, triggering rollback on latency or error rate deviations
Uber enforces p99 inference latency budgets under 50 milliseconds for online services; canary stages monitor both infrastructure regressions (CPU, memory, tail latencies) and business KPIs like dispatch accuracy
Short canary windows catch infrastructure failures quickly but may miss small KPI deltas requiring hours of data; low traffic segments need broader canaries or longer windows for statistical power
📌 Examples
Airbnb's Bighead platform runs shadow deployments for several days mirroring 100 percent of production requests to observe drift and cost, then graduates to 5 percent canary before blue green promotion for instant rollback capability
LinkedIn starts ranking model canaries at 1 percent with strict p99 latency guardrails (tens of milliseconds per subcall) to protect aggregate page latency, ramping over hours as business metrics converge
← Back to Model Versioning & Rollback Overview
Blue Green and Canary Deployment Patterns for Model Rollout | Model Versioning & Rollback - System Overflow