A/B Testing & ExperimentationStatistical Significance & Confidence IntervalsMedium⏱️ ~3 min

Experimentation at Scale: Randomization, Metrics, and Variance Reduction

RANDOMIZATION MECHANICS

Use deterministic hashing for assignment: hash(user_id + salt) mod 100. This ensures consistent assignment across sessions and devices. The salt changes per experiment to decorrelate overlapping experiments. Without deterministic hashing, the same user might flip between treatment and control, contaminating both arms.

CUPED: VARIANCE REDUCTION

CUPED (Controlled-experiment Using Pre-Experiment Data) reduces variance by 10-40% by subtracting the regression on pre-period behavior. If a user had high engagement before the experiment, we expect high engagement during, regardless of treatment. Adjusting for pre-period behavior removes this source of variance, turning a 14-day test into 9-10 days with the same power.

SAMPLE RATIO MISMATCH (SRM)

SRM checks verify that traffic splits match expectations. If you expect 50/50 but observe 53/47, something is wrong: client bugs dropping events, biased filtering, or assignment bugs. Use chi-squared test with p < 0.001 to alarm. SRM invalidates all results because the observed difference may come from selection bias, not treatment effect.

⚠️ Key Trade-off: Streaming metrics (30-60 second latency) provide operational visibility but cannot show significance early. Store per-user aggregates to enable bootstrap resampling without reprocessing petabytes.

SCALE CONSIDERATIONS

Large-scale experiments process tens of billions of events daily. Stream aggregation pipelines compute running statistics. Store per-arm per-user aggregates for efficient bootstrap. With 20M daily users, 1.6M per arm reaches 80% power for 5% CTR lift in hours.

💡 Key Takeaways
Deterministic hashing: hash(user_id + salt) mod 100 ensures consistent assignment across sessions and devices
CUPED reduces variance 10-40% by adjusting for pre-period behavior, shortening tests from 14 days to 9-10 days
SRM (Sample Ratio Mismatch) checks verify traffic splits; 53/47 instead of 50/50 invalidates all results
Store per-user aggregates for efficient bootstrap without reprocessing petabytes of raw events
📌 Interview Tips
1Explain why deterministic hashing matters: without it, users flip between arms, contaminating both
2Describe CUPED: subtracting regression on pre-period removes variance, accelerating experiments by 30-40%
3Mention SRM as an automatic quality check: chi-squared test with p<0.001 triggers alarm and pause
← Back to Statistical Significance & Confidence Intervals Overview