Learn→A/B Testing & Experimentation→Statistical Significance & Confidence Intervals→5 of 5

A/B Testing & Experimentation • Statistical Significance & Confidence IntervalsHard⏱️ ~3 min

Failure Modes: SRM, Peeking, Interference, and Heavy Tails in Production

SAMPLE RATIO MISMATCH (SRM)
Observing 53/47 instead of expected 50/50 invalidates all results. The observed difference may come from selection bias (e.g., mobile client bug dropping events in treatment), not treatment effect. Run chi-squared test with p < 0.001 as automatic alarm. SRM should pause the experiment until the root cause is identified and fixed.
PEEKING ABUSE
A team peeked daily for 10 days and stopped when p = 0.04. Later analysis showed true alpha was 18%, not 5%. The result was likely a false positive. Prevention: Require minimum exposure before showing results. Use group sequential designs with pre-planned analysis points. Lock down early stopping to automated systems that maintain proper alpha.
HEAVY TAILED METRICS
Revenue and watch time are heavy-tailed: a few users contribute disproportionately. Mean confidence intervals are 2-5x wider than normal metrics. Mitigations: (1) Log transform to compress the tail. (2) Winsorization: cap values at the 99th percentile. (3) Bootstrap intervals that handle outliers empirically. Each trades some information for stability.
⚠️ Key Trade-off: Winsorization reduces variance but biases the estimate toward median. Log transform helps normality but complicates interpretation. Choose based on stakeholder needs.
NOVELTY AND TEMPORAL EFFECTS
Day 1 may show positive effect from novelty (users explore new UI), but confidence interval crosses zero by day 5 as novelty wears off. Or learning effects: users initially struggle, then adapt. Do not ship based on early windows alone. Require 7+ day observation with stable intervals before declaring a winner.

💡 Key Takeaways

✓SRM (53/47 instead of 50/50) invalidates all results; chi-squared p<0.001 should auto-pause experiment

✓Peeking daily and stopping at p<0.05 inflates true alpha to 15-20%; require minimum exposure before results

✓Heavy-tailed metrics (revenue) have 2-5x wider CIs; use log transform, winsorization, or bootstrap

✓Novelty effects bias early windows; require 7+ days with stable intervals before shipping

📌 Interview Tips

1Describe SRM detection: chi-squared test on traffic split, p<0.001 triggers automatic pause

2Explain the peeking trap: team stopped at p=0.04 after daily checks, true alpha was 18%

3Mention heavy-tail mitigations: winsorize at 99th percentile or log-transform before computing intervals

← Back to Statistical Significance & Confidence Intervals Overview