Trade-Offs: Sequential Monitoring, Unit of Randomization, and Interval Methods
PEEKING AND ALPHA INFLATION
Peeking hourly at a weekly test and stopping when p < 0.05 inflates false positive rate far beyond 5%. With daily checks over 10 days, true alpha can reach 15-20%. The problem: each peek is another chance to hit 0.05 by chance. Solutions: Use alpha spending functions (allocate alpha budget across planned looks) or always-valid sequential tests that maintain proper coverage.
UNIT OF RANDOMIZATION
User level: Clean independence between observations, standard errors work. But you get fewer units than requests.
Session/request level: More observations, faster tests. But observations from the same user are correlated. You need cluster-robust standard errors (clustering by user) to avoid inflated significance.
MARKETPLACE INTERFERENCE
In two-sided marketplaces (rideshare, e-commerce with shared inventory), treatment affects control through shared resources. If treatment users book more drivers, fewer drivers remain for control. User-level randomization is biased. Switchback experiments randomize entire regions by time slots (15-minute periods), trading unbiasedness for practicality.
RATIO METRICS
Metrics like revenue per session or clicks per user require special handling. Naive intervals treating numerator and denominator as independent produce wrong coverage. Use the delta method (Taylor expansion) or bootstrap. Delta method is 10-100x faster at scale.