Canary Metrics: System, Product, and Data Quality Signals
SYSTEM METRICS
System metrics catch infrastructure problems immediately. Key signals: P95 latency delta under 5ms, P99 under 250ms total, error rate delta under 0.05% absolute. Use 5-15 minute trailing windows. If canary P99 spikes to 320ms for 10+ minutes when baseline is 200ms, trigger automatic rollback. These metrics provide fast feedback (minutes) with high confidence.
PRODUCT METRICS
Product metrics catch model quality problems but require more time. CTR, conversion rate, and engagement need hours of data to reach statistical significance. At 5% traffic (28.8M requests over 24 hours), you can detect 0.3% relative CTR change with 80% power. Use CUPED (Controlled-experiment Using Pre-Experiment Data) to reduce variance by 30%: adjust post-period measurements by pre-period differences between cohorts.
DATA QUALITY METRICS
Data quality metrics catch feature pipeline problems. Track: feature null rate (target under 0.5%), out-of-range values, distribution drift using KL divergence between canary and baseline. If baseline null rate is 0.3% and canary is 0.8%, that 0.5% delta indicates a missing feature dependency in the canary environment.
COMPOSITE SCORING
Combine metrics into a single gate decision: weight categories (40% reliability, 50% product impact, 10% data quality), normalize to 0-100 scale. Pass threshold might be 70+. This simplifies the "should we ramp?" decision into a single number with clear thresholds.