A/B Testing & Experimentation • Statistical Significance & Confidence IntervalsHard⏱️ ~3 min
Failure Modes: SRM, Peeking, Interference, and Heavy Tails in Production
Sample Ratio Mismatch (SRM) is the most common silent killer of experiment validity. If you assign 50/50 but observe 53/47 traffic, do not trust any p-values or confidence intervals. Causes include client side bucketing bugs, race conditions in assignment logic, cookie churn affecting one arm more, or filtering steps (for example, removing bots) that correlate with treatment. Large companies run automatic chi squared tests with alarm thresholds at p less than 0.001. A single SRM can invalidate weeks of work and lead to wrong shipping decisions.
Peeking and optional stopping inflate false positives even when you think you are being careful. Looking at results daily and stopping the first time p less than 0.05 can push your true alpha from 5 percent to 15 percent or higher. The fix is to commit to a sample size in advance or use proper sequential methods like alpha spending. Many companies disable the ship button until minimum exposure and minimum sample size are met, preventing premature decisions.
Interference and spillovers break the independence assumption. In search or feeds, a ranking change can alter cache hit rates, biasing the control group. In marketplaces, driver or inventory supply is shared across arms. Uber observed that user level randomization in driver experiments caused 30 percent bias in estimated effects; switching to geographic cluster or switchback designs reduced this to under 10 percent. Residual interference remains and widens intervals, but the bias is more acceptable.
Heavy tailed metrics like watch time, revenue, or p95 latency create two problems. First, mean based confidence intervals can be extremely wide or misleading because a few extreme values dominate. Second, tail quantiles like p95 or p99 have high variance and need much larger samples or longer durations to stabilize. Solutions include log transforms, winsorization (capping extreme values), or bootstrap percentile intervals. Even with these, expect tests on tail metrics to take 2x to 3x longer than tests on means.
💡 Key Takeaways
•Sample Ratio Mismatch with observed 53/47 instead of expected 50/50 invalidates all results; chi squared test with p less than 0.001 should trigger automatic alarm and experiment pause
•Peeking hourly at a 7 day test and stopping when p less than 0.05 inflates true false positive rate from 5 percent to 15 percent; require minimum exposure before showing results
•Uber marketplace experiments showed 30 percent bias from shared driver supply with user randomization; switchback by geography reduced bias to under 10 percent
•Heavy tailed watch time or revenue causes mean confidence intervals to be 2x to 5x wider than normal metrics; use log transform or winsorization to stabilize
•Ratio metrics like revenue per session need delta method or bootstrap; naive independence assumption produces wrong coverage and misleading intervals
•Non stationary metrics show novelty effects: day 1 may be positive but confidence interval crosses zero by day 5; do not ship based on early window alone
📌 Examples
Meta feed experiment: observed 52/48 traffic split instead of 50/50 due to mobile client bug dropping events in treatment, SRM chi squared p = 0.0001 triggered pause
Netflix A/B test: team peeked daily for 10 days and stopped when p = 0.04, later analysis showed true alpha was 18 percent with this strategy, not 5 percent
Uber driver experiment: user level randomization estimated 8 percent ETA improvement but switchback design showed only 5 percent with wider CI [3, 7] reflecting true uncertainty
Google Search revenue per query: bootstrap interval [1.2, 1.9] cents with 10,000 resamples took 40 minutes compute, delta method gave [1.3, 1.8] in 2 seconds with similar coverage