Key Concept
ML canary analysis layers model quality and business metrics on top of traditional infrastructure SLOs. Two tiers: fast infrastructure guardrails (seconds) and slower ML/business metrics (minutes to hours).
FAST GUARDRAILS
Checked every 30-60 seconds: success rate ≥99%, P99 latency <500ms, CPU <90%, memory <95%. These gate traffic increases. If latency spikes or error rate jumps, rollback immediately.
SLOW ML METRICS
CTR, conversion rate, or prediction error take minutes to hours to accumulate signal. Run in background analysis windows (10-30 minutes). If CTR drops >5% after several intervals, halt promotion or trigger rollback even if infrastructure metrics are healthy.
💡 Pattern: Gate on request success and latency first, then watch engagement metrics (time spent, interactions) over longer periods.
SEGMENT-LEVEL ANALYSIS
Aggregate metrics mask segment problems. Model might improve overall precision 0.82→0.85 but hurt recall for new users 0.60→0.45. Track metrics by segment (user cohort, device, geography) and require canary to not degrade any critical segment. Track separately by city, time of day, and trip type.
DISTRIBUTION DRIFT
Compare input and output distributions between canary and baseline. KL divergence or PSI >0.2 signals different data distribution, invalidating comparison. Check calibration: if canary predicts 70% confidence but actual rate is 50%, that miscalibration breaks downstream decisions.
⚠️ Calibration: Critical for probability outputs. Miscalibration can break downstream systems even when accuracy looks similar.
✓Two tier gating: Fast infrastructure metrics (99 percent success, 500 ms P99, checked every 30 to 60 seconds) gate traffic increases, slow ML metrics (CTR drop within 5 percent, AUC drift under 0.02) run in 10 to 30 minute background windows
✓Segment level analysis prevents hidden regressions where aggregate metrics look good but specific cohorts (new users, device types, geographies) degrade significantly
✓Distribution drift checks (KL divergence or PSI above 0.2) detect when canary sees different data than baseline, invalidating metric comparisons
✓Calibration error is critical for probability predictions: canary predicting 70 percent confidence with 50 percent actual rate breaks downstream systems even if accuracy is similar
✓Netflix gates recommendation canaries on P99 inference under 150 ms and stable error rate, monitors CTR and watch time over longer windows to catch business impact before full rollout