A/B Testing & ExperimentationStatistical Significance & Confidence IntervalsMedium⏱️ ~3 min

Experimentation at Scale: Randomization, Metrics, and Variance Reduction

At Meta and Google scale, A/B tests process tens of billions of events daily. The foundation is clean randomization using a deterministic hash of user ID, ensuring assignment is consistent across sessions and devices. Events like impressions, clicks, and watch time stream into logging pipelines at millions of events per second. A streaming aggregator computes rolling metrics with 5 to 60 second end to end latency for operational dashboards, while a batch job recomputes with stronger correctness guarantees on hourly or daily windows. Product teams specify primary metrics (for example, Click-Through Rate) and guardrails (crash rate, p95 latency, revenue impact). Suppose baseline CTR is 2.00 percent and you expect a 5 percent relative lift to 2.10 percent. With 20 million daily active users and 1.6 million users needed per arm, you reach sufficient sample size in a few hours. For rare events like a 0.05 percent purchase rate, you may need tens of millions of users and multiple days. Variance reduction techniques dramatically shorten test duration. Covariate Adjusted Estimators (CUPED) subtracts a regression on pre experiment behavior, reducing variance by 10 to 40 percent. This can turn a 14 day test into a 9 day test. Uber uses switchback experiments in marketplaces, randomizing entire cities by 15 minute time slots (treatment then control then treatment) to combat interference where driver supply affects both arms. These patterns handle real world complexity while maintaining statistical validity. The analysis pipeline must handle high throughput: streaming aggregators process hundreds of thousands to millions of events per second. Store per arm per user aggregates to support bootstrap resampling without reprocessing raw logs. Version metric definitions and randomization seeds for reproducibility. Automated checks for Sample Ratio Mismatch (SRM) using chi squared tests alarm when observed traffic splits deviate from expected (for example, 53/47 instead of 50/50), indicating bucketing bugs or filtering issues.
💡 Key Takeaways
Use deterministic hash of user ID for assignment, ensuring consistency across sessions and devices to avoid selection bias
At Meta and Google scale, experiments process tens of billions of events daily with streaming latency of 5 to 60 seconds for operational visibility
CUPED variance reduction subtracts regression on pre period behavior, cutting variance by 10 to 40 percent and turning a 14 day test into 9 days
Uber switchback experiments randomize entire cities by 15 minute time slots to handle marketplace interference where driver supply cross contaminates control and treatment
Sample Ratio Mismatch (SRM) checks automatically alarm when traffic splits deviate (for example, 53/47 instead of 50/50), indicating bucketing bugs or biased filtering
Store per arm per user aggregates to enable bootstrap resampling without reprocessing petabytes of raw event logs
📌 Examples
Meta feed ranking: 20M daily active users, 1.6M per arm for 5 percent CTR lift, test completes in hours with streaming metrics updated every 30 seconds
Netflix recommendation: CUPED on pre period watch time reduces variance by 35 percent, shortening test from 10 days to 6 days while maintaining 80 percent power
Uber marketplace: switchback design alternates entire city between control and treatment every 15 minutes to avoid driver supply contamination across arms
Google Search: chi squared SRM check with p less than 0.001 triggers alarm when observing 52/48 traffic split instead of expected 50/50, pausing experiment until bug is fixed
← Back to Statistical Significance & Confidence Intervals Overview
Experimentation at Scale: Randomization, Metrics, and Variance Reduction | Statistical Significance & Confidence Intervals - System Overflow