Failure Modes: SRM, Peeking, Interference, and Heavy Tails in Production
SAMPLE RATIO MISMATCH (SRM)
Observing 53/47 instead of expected 50/50 invalidates all results. The observed difference may come from selection bias (e.g., mobile client bug dropping events in treatment), not treatment effect. Run chi-squared test with p < 0.001 as automatic alarm. SRM should pause the experiment until the root cause is identified and fixed.
PEEKING ABUSE
A team peeked daily for 10 days and stopped when p = 0.04. Later analysis showed true alpha was 18%, not 5%. The result was likely a false positive. Prevention: Require minimum exposure before showing results. Use group sequential designs with pre-planned analysis points. Lock down early stopping to automated systems that maintain proper alpha.
HEAVY TAILED METRICS
Revenue and watch time are heavy-tailed: a few users contribute disproportionately. Mean confidence intervals are 2-5x wider than normal metrics. Mitigations: (1) Log transform to compress the tail. (2) Winsorization: cap values at the 99th percentile. (3) Bootstrap intervals that handle outliers empirically. Each trades some information for stability.
NOVELTY AND TEMPORAL EFFECTS
Day 1 may show positive effect from novelty (users explore new UI), but confidence interval crosses zero by day 5 as novelty wears off. Or learning effects: users initially struggle, then adapt. Do not ship based on early windows alone. Require 7+ day observation with stable intervals before declaring a winner.