A/B Testing & ExperimentationStatistical Significance & Confidence IntervalsHard⏱️ ~3 min

Trade-Offs: Sequential Monitoring, Unit of Randomization, and Interval Methods

PEEKING AND ALPHA INFLATION

Peeking hourly at a weekly test and stopping when p < 0.05 inflates false positive rate far beyond 5%. With daily checks over 10 days, true alpha can reach 15-20%. The problem: each peek is another chance to hit 0.05 by chance. Solutions: Use alpha spending functions (allocate alpha budget across planned looks) or always-valid sequential tests that maintain proper coverage.

UNIT OF RANDOMIZATION

User level: Clean independence between observations, standard errors work. But you get fewer units than requests.
Session/request level: More observations, faster tests. But observations from the same user are correlated. You need cluster-robust standard errors (clustering by user) to avoid inflated significance.

MARKETPLACE INTERFERENCE

In two-sided marketplaces (rideshare, e-commerce with shared inventory), treatment affects control through shared resources. If treatment users book more drivers, fewer drivers remain for control. User-level randomization is biased. Switchback experiments randomize entire regions by time slots (15-minute periods), trading unbiasedness for practicality.

⚠️ Key Trade-off: Switchback reduces interference bias by 60-70% but widens confidence intervals by 20-30% because you have fewer independent units (time slots) than users.

RATIO METRICS

Metrics like revenue per session or clicks per user require special handling. Naive intervals treating numerator and denominator as independent produce wrong coverage. Use the delta method (Taylor expansion) or bootstrap. Delta method is 10-100x faster at scale.

💡 Key Takeaways
Peeking inflates false positive rate from 5% to 15-20%; use alpha spending or always-valid sequential tests
Session-level randomization needs cluster-robust standard errors to avoid inflated significance from correlated observations
Marketplace interference: switchback experiments randomize regions by time slots, reducing bias 60-70% but widening CI 20-30%
Ratio metrics (revenue/session) need delta method or bootstrap; naive intervals produce wrong coverage
📌 Interview Tips
1Explain the peeking problem: daily checks for 10 days inflates true alpha from 5% to 15-20%
2Describe unit of randomization trade-off: user level is clean but slow, request level is fast but needs clustering
3Mention switchback for marketplaces: randomize entire city by 15-min slots to avoid shared resource contamination
← Back to Statistical Significance & Confidence Intervals Overview