ML Infrastructure & MLOpsAutomated Rollback & Canary AnalysisHard⏱️ ~3 min

ML Specific Guardrails and Metrics in Canary Analysis

Key Concept
ML canary analysis layers model quality and business metrics on top of traditional infrastructure SLOs. Two tiers: fast infrastructure guardrails (seconds) and slower ML/business metrics (minutes to hours).

FAST GUARDRAILS

Checked every 30-60 seconds: success rate ≥99%, P99 latency <500ms, CPU <90%, memory <95%. These gate traffic increases. If latency spikes or error rate jumps, rollback immediately.

SLOW ML METRICS

CTR, conversion rate, or prediction error take minutes to hours to accumulate signal. Run in background analysis windows (10-30 minutes). If CTR drops >5% after several intervals, halt promotion or trigger rollback even if infrastructure metrics are healthy.

💡 Pattern: Gate on request success and latency first, then watch engagement metrics (time spent, interactions) over longer periods.

SEGMENT-LEVEL ANALYSIS

Aggregate metrics mask segment problems. Model might improve overall precision 0.82→0.85 but hurt recall for new users 0.60→0.45. Track metrics by segment (user cohort, device, geography) and require canary to not degrade any critical segment. Track separately by city, time of day, and trip type.

DISTRIBUTION DRIFT

Compare input and output distributions between canary and baseline. KL divergence or PSI >0.2 signals different data distribution, invalidating comparison. Check calibration: if canary predicts 70% confidence but actual rate is 50%, that miscalibration breaks downstream decisions.

⚠️ Calibration: Critical for probability outputs. Miscalibration can break downstream systems even when accuracy looks similar.
💡 Key Takeaways
Two tier gating: Fast infrastructure metrics (99 percent success, 500 ms P99, checked every 30 to 60 seconds) gate traffic increases, slow ML metrics (CTR drop within 5 percent, AUC drift under 0.02) run in 10 to 30 minute background windows
Segment level analysis prevents hidden regressions where aggregate metrics look good but specific cohorts (new users, device types, geographies) degrade significantly
Distribution drift checks (KL divergence or PSI above 0.2) detect when canary sees different data than baseline, invalidating metric comparisons
Calibration error is critical for probability predictions: canary predicting 70 percent confidence with 50 percent actual rate breaks downstream systems even if accuracy is similar
Netflix gates recommendation canaries on P99 inference under 150 ms and stable error rate, monitors CTR and watch time over longer windows to catch business impact before full rollout
📌 Interview Tips
1Meta feed ranking canary: gates on request success rate 99 percent and P99 latency 200 ms every 30 seconds, monitors engagement metrics (time spent, interactions) over 20 minute windows, requires no segment to drop more than 3 percent
2Uber ETA prediction canary: tracks prediction error separately by city, time of day, trip type, ensures canary does not increase error for high value segments (airport trips, peak hours) by more than 5 percent even if overall error improves
← Back to Automated Rollback & Canary Analysis Overview