A/B Testing & ExperimentationRamp-up Strategies & Canary AnalysisMedium⏱️ ~3 min

Canary Metrics: System, Product, and Data Quality Signals

ML canary analysis requires monitoring three metric categories in parallel. System reliability metrics include P50, P95, P99 latency, error rates by type (timeout, 5xx, 4xx), and resource utilization (CPU, GPU, memory per replica). For a ranking service with P95 target 120 milliseconds and P99 target 250 milliseconds, the canary passes if P95 delta is under 5 milliseconds and P99 stays below 220 milliseconds over 5 to 15 minute sliding windows. Error rate budget is 0.1 percent, so canary delta must stay under 0.05 percent absolute. Product metrics measure user impact: CTR, conversion rate, session length, next day retention. These require statistical rigor because natural variance is high. For CTR evaluation, use variance reduction techniques like Controlled experiment Using Pre Experiment Data (CUPED) that leverages pre period CTR to reduce noise. A 2 hour window at 5 percent traffic (4,000 requests per second equals 28.8 million requests) can detect a 0.3 percent CTR change with 80 percent statistical power. Netflix evaluates CTR delta within plus or minus 0.2 percent as neutral; beyond that requires deeper investigation. Data quality signals catch feature pipeline failures invisible to system metrics. Track feature null rate (percentage of requests with missing feature values), out of range rate (values outside expected bounds), and distribution drift using population statistics like mean, variance, or distance measures like Kullback Leibler (KL) divergence between canary and baseline feature distributions. A new model that increases feature null rate from 0.3 percent to 0.8 percent indicates a pipeline dependency issue. Google monitors feature freshness by logging feature computation timestamps and alerting when staleness exceeds thresholds, for example features older than 10 minutes. The decision engine combines all three. Layer one guardrails trigger automatic rollback: error rate delta greater than 0.1 percent for 5 minutes, P99 latency greater than 300 milliseconds for 10 minutes, or feature nulls greater than 0.5 percent. Layer two compares product metrics using sequential tests with multiple comparison correction. Uber Michelangelo teams configure weighted composite scores: 40 percent system reliability, 50 percent product metrics, 10 percent data quality, with each category normalized to 0 to 100 scale.
💡 Key Takeaways
System metrics: P95 latency delta under 5ms, P99 under 250ms, error rate delta under 0.05% absolute over 5 to 15 minute windows
Product metrics: CTR change detection with CUPED variance reduction over 2 hour window at 5% traffic (28.8M requests) detects 0.3% change with 80% power
Data quality: Track feature null rate (target under 0.5%), out of range rate, distribution drift using KL divergence between canary and baseline
Two layer decisions: Layer one automatic rollback on guardrails (error rate, P99, nulls), layer two statistical tests on product metrics with multiple comparison correction
Composite scoring: Weight categories like 40% reliability, 50% product impact, 10% data quality, normalize to 0 to 100 scale for single gate decision
📌 Examples
Guardrail violation: Canary P99 latency spikes to 320ms for 12 minutes, exceeds 300ms threshold for 10+ minutes, triggers automatic rollback to 0%
CUPED for CTR: Pre period CTR for canary cohort is 3.1%, baseline 3.2%. Adjust post period measurements by pre period difference to reduce variance by 30%
Feature null detection: Baseline null rate 0.3%, canary 0.8%. Delta 0.5% exceeds 0.2% threshold, indicates feature pipeline dependency missing in canary environment
← Back to Ramp-up Strategies & Canary Analysis Overview
Canary Metrics: System, Product, and Data Quality Signals | Ramp-up Strategies & Canary Analysis - System Overflow