A/B Testing & ExperimentationStatistical Significance & Confidence IntervalsHard⏱️ ~3 min

Failure Modes: SRM, Peeking, Interference, and Heavy Tails in Production

SAMPLE RATIO MISMATCH (SRM)

Observing 53/47 instead of expected 50/50 invalidates all results. The observed difference may come from selection bias (e.g., mobile client bug dropping events in treatment), not treatment effect. Run chi-squared test with p < 0.001 as automatic alarm. SRM should pause the experiment until the root cause is identified and fixed.

PEEKING ABUSE

A team peeked daily for 10 days and stopped when p = 0.04. Later analysis showed true alpha was 18%, not 5%. The result was likely a false positive. Prevention: Require minimum exposure before showing results. Use group sequential designs with pre-planned analysis points. Lock down early stopping to automated systems that maintain proper alpha.

HEAVY TAILED METRICS

Revenue and watch time are heavy-tailed: a few users contribute disproportionately. Mean confidence intervals are 2-5x wider than normal metrics. Mitigations: (1) Log transform to compress the tail. (2) Winsorization: cap values at the 99th percentile. (3) Bootstrap intervals that handle outliers empirically. Each trades some information for stability.

⚠️ Key Trade-off: Winsorization reduces variance but biases the estimate toward median. Log transform helps normality but complicates interpretation. Choose based on stakeholder needs.

NOVELTY AND TEMPORAL EFFECTS

Day 1 may show positive effect from novelty (users explore new UI), but confidence interval crosses zero by day 5 as novelty wears off. Or learning effects: users initially struggle, then adapt. Do not ship based on early windows alone. Require 7+ day observation with stable intervals before declaring a winner.

💡 Key Takeaways
SRM (53/47 instead of 50/50) invalidates all results; chi-squared p<0.001 should auto-pause experiment
Peeking daily and stopping at p<0.05 inflates true alpha to 15-20%; require minimum exposure before results
Heavy-tailed metrics (revenue) have 2-5x wider CIs; use log transform, winsorization, or bootstrap
Novelty effects bias early windows; require 7+ days with stable intervals before shipping
📌 Interview Tips
1Describe SRM detection: chi-squared test on traffic split, p<0.001 triggers automatic pause
2Explain the peeking trap: team stopped at p=0.04 after daily checks, true alpha was 18%
3Mention heavy-tail mitigations: winsorize at 99th percentile or log-transform before computing intervals
← Back to Statistical Significance & Confidence Intervals Overview
Failure Modes: SRM, Peeking, Interference, and Heavy Tails in Production | Statistical Significance & Confidence Intervals - System Overflow