Failure Modes: Biased Cohorts, Cold Start, and Feedback Loops
BIASED COHORTS
Small canary percentages can create biased samples. If 5% canary happens to skew toward power users (higher engagement), you see 0.5% CTR lift that vanishes at 100% when casual users dominate. Detection: Compare pre-period metrics between cohorts before starting the experiment. If canary pre-period CTR is 3.5% vs baseline 3.2%, rebalance strata. Prevention: Use stratified sampling by user segment.
COLD START LATENCY
New replicas spike P99 latency from 210ms to 400ms for the first 5 minutes while caches warm. This triggers false rollback even though the steady-state performance is fine. Mitigation: Pre-warm by replaying the last 60 minutes of requests at 10x speed (6 minutes replay time). Use 10-15 minute grace periods before evaluating latency metrics. Flag the cold start window in logs for exclusion from analysis.
FEEDBACK LOOPS AND TEMPORAL EFFECTS
Novelty effect: Users engage more with new UI in the first hour, then revert to baseline. Learning effect: Users initially struggle with changes, then adapt. Both bias short-window measurements. Mitigation: Use 24-hour evaluation windows and maintain parallel holdouts for retention metrics. Compare day-1, day-7, and day-30 cohort behavior to separate novelty from true improvement.
DEPENDENCY SATURATION
Canary model increases embedding service QPS by 30%. At 25% traffic, you hit the embedding service capacity limit (15k QPS), causing timeouts. The model gets blamed for latency when the real issue is downstream capacity. Prevention: Validate downstream service headroom before each ramp step.