A/B Testing & ExperimentationRamp-up Strategies & Canary AnalysisMedium⏱️ ~3 min

What is Canary Analysis in ML Systems?

Definition
Canary analysis gradually exposes a new model or feature to a small percentage of production traffic, monitoring for problems before full deployment. The name comes from coal miners using canaries to detect toxic gases: if the canary dies, stop digging.

WHY CANARIES ARE ESSENTIAL FOR ML

ML models can pass all offline tests with excellent metrics (0.85 MAP, 0.92 AUC) yet fail catastrophically in production. Reasons include training-serving skew, feature pipeline bugs, distribution shift, or resource contention. Offline evaluation cannot catch these issues because it does not use live traffic, real feature stores, or production infrastructure.

THREE DIMENSIONS OF CANARY EVALUATION

System reliability: P95/P99 latency, error rates, memory usage. Catches infrastructure issues immediately.
Product metrics: CTR, conversion rate, engagement. Catches model quality problems over hours.
Data quality: Feature null rates, value distributions, drift detection. Catches feature pipeline issues.

💡 Key Insight: System metrics fail fast (minutes), product metrics fail slow (hours). Canaries must monitor both with appropriate time windows.

TYPICAL RAMP SCHEDULE

Start at 0.5-1% traffic for 30-60 minutes to catch immediate failures. If healthy, increase to 5% for 2 hours, then 25% for 12 hours. Each step has automated gates that check metrics before proceeding. Full ramp from 1% to 100% typically takes 24-48 hours.

💡 Key Takeaways
Canary exposes new model to small traffic percentage, monitoring three dimensions: system reliability, product metrics, and data quality
System metrics fail fast (minutes); product metrics fail slow (hours); canaries must monitor both with appropriate time windows
Typical ramp: 1% for 30 min → 5% for 2 hours → 25% for 12 hours → 50% → 100%, with automated gates between steps
ML models can pass offline tests yet fail in production due to training-serving skew, feature bugs, or distribution shift
📌 Interview Tips
1When explaining canary deployment, cover the three dimensions: system metrics (latency, errors), product metrics (CTR), and data quality (feature nulls)
2Mention that offline evaluation cannot catch production failures because it does not use live traffic or real feature stores
3Describe a typical ramp schedule with concrete times: 1% for 30 min, 5% for 2 hours, 25% for 12 hours
← Back to Ramp-up Strategies & Canary Analysis Overview