What is Canary Analysis in ML Systems?
WHY CANARIES ARE ESSENTIAL FOR ML
ML models can pass all offline tests with excellent metrics (0.85 MAP, 0.92 AUC) yet fail catastrophically in production. Reasons include training-serving skew, feature pipeline bugs, distribution shift, or resource contention. Offline evaluation cannot catch these issues because it does not use live traffic, real feature stores, or production infrastructure.
THREE DIMENSIONS OF CANARY EVALUATION
System reliability: P95/P99 latency, error rates, memory usage. Catches infrastructure issues immediately.
Product metrics: CTR, conversion rate, engagement. Catches model quality problems over hours.
Data quality: Feature null rates, value distributions, drift detection. Catches feature pipeline issues.
TYPICAL RAMP SCHEDULE
Start at 0.5-1% traffic for 30-60 minutes to catch immediate failures. If healthy, increase to 5% for 2 hours, then 25% for 12 hours. Each step has automated gates that check metrics before proceeding. Full ramp from 1% to 100% typically takes 24-48 hours.