Canary Failure Modes and Mitigation Strategies
LOW TRAFFIC VOLUME
Need several thousand requests/minute/instance for statistical validity. Below that, variance creates noise → false rollback alarms or missed regressions. Unequal instance sets amplify: if baseline has 99 instances and canary has 1, single outage has 100× impact on canary metrics. Fix: compare equally sized monitoring sets, ensure enough traffic per instance.
WARM-UP EFFECTS
New service has cold JVM JIT compilation, cold CPU/GPU caches. Latency spikes 2-5× during first minutes, triggering rollback even though canary would be fine once warmed. Fix: run shadow traffic 5-10 min first, add pause before first analysis window, or send synthetic load to pre-warm instances.
THRESHOLD MISCONFIGURATION
Too tight (99.95% on service that runs 99.9%): frequent false rollbacks. Too loose (allow 10% error increase): miss regressions. Calibrate from historical baselines over weeks. Use rolling windows (majority of checks passing over 3-5 intervals) to smooth spikes.
SEGMENT REGRESSIONS
Overall CTR flat while new-user CTR drops 20%. Track critical segments separately and require all to pass. Deploy baseline and canary in same availability zones with matched resources to avoid zone-specific biases.