A/B Testing & Experimentation • Statistical Significance & Confidence IntervalsHard⏱️ ~3 min
Trade-Offs: Sequential Monitoring, Unit of Randomization, and Interval Methods
Fixed horizon tests are simple and guarantee exact alpha control at the planned sample size, but they prevent early stopping. If you peek hourly at a weekly test and stop when p dips below 0.05, you inflate the false positive rate far beyond 5 percent. Sequential methods like alpha spending or always valid tests allow continuous monitoring but add conceptual complexity. Dashboards must show valid interim intervals aligned with the chosen sequential method, and teams must be trained to interpret them correctly.
The unit of randomization creates fundamental trade-offs. Randomizing by user gives clean independence for per user metrics like CTR or conversion rate. Randomizing by session or request yields more samples for latency metrics and faster results, but requires cluster robust standard errors to avoid inflated significance from correlated observations. For marketplaces or social networks, independent units do not exist at all. Uber and Lyft use switchbacks (randomizing by time slot) or geographic clusters, trading unbiasedness for practicality. Residual interference can remain, widening intervals.
Interval construction method matters for heavy tailed metrics. Normal or t based intervals are efficient for means and proportions with large samples, but they mislead for watch time, revenue, or tail latencies like p95 or p99. Bootstrap percentile intervals are robust to distribution shape and handle complex metrics like ratios, but they are computationally expensive at petabyte scale and can be biased with small samples. For ratio metrics like revenue per session, naive intervals that treat numerator and denominator as independent are wrong; use the delta method or bootstrap to account for covariance.
Minimum detectable effect versus test duration is the final major trade-off. Targeting very small effects like 0.2 percent relative CTR requires massive samples or many days. Companies often choose to only test changes with predicted impact above a threshold, or accumulate evidence across multiple similar launches via meta analysis rather than running each test to detect tiny effects.
💡 Key Takeaways
•Peeking hourly at a weekly test and stopping when p less than 0.05 inflates false positive rate far beyond 5 percent; use alpha spending or always valid tests for continuous monitoring
•User level randomization gives clean independence for CTR but session level randomization needs cluster robust standard errors to avoid inflated significance from correlated observations
•Marketplace experiments face unavoidable interference: Uber switchbacks randomize entire city by 15 minute slots, trading unbiasedness for practicality with residual bias widening intervals
•Bootstrap intervals handle heavy tailed revenue or p95 latency robustly but cost 10x to 100x more computation than parametric intervals at petabyte scale
•Ratio metrics like revenue per session require delta method or bootstrap; naive intervals treating numerator and denominator as independent produce wrong coverage
•Detecting 0.2 percent relative CTR lift needs 16x more users than 2 percent lift (inverse square scaling); consider business thresholds or meta analysis instead of testing tiny effects
📌 Examples
Meta experiment with hourly peeks: observed false positive rate jumps to 15 percent instead of 5 percent; switching to group sequential design with 3 planned looks restores valid alpha control
Netflix latency test: randomizing by request gives 100x more samples than user randomization, enables 4 hour test instead of 2 days, but requires cluster robust standard errors at user level
Uber marketplace: switchback design with 15 minute slots reduces driver supply interference by 60 percent compared to user randomization, but residual bias widens confidence intervals by 20 percent
Google revenue per search: using delta method for ratio metric produces 95 percent CI [1.2, 1.8] cents, naive approach incorrectly gives [1.5, 2.1] cents with overcoverage