Learn→A/B Testing & Experimentation→Statistical Significance & Confidence Intervals→3 of 5

A/B Testing & Experimentation • Statistical Significance & Confidence IntervalsMedium⏱️ ~3 min

Experimentation at Scale: Randomization, Metrics, and Variance Reduction

RANDOMIZATION MECHANICS
Use deterministic hashing for assignment: hash(user_id + salt) mod 100. This ensures consistent assignment across sessions and devices. The salt changes per experiment to decorrelate overlapping experiments. Without deterministic hashing, the same user might flip between treatment and control, contaminating both arms.
CUPED: VARIANCE REDUCTION
CUPED (Controlled-experiment Using Pre-Experiment Data) reduces variance by 10-40% by subtracting the regression on pre-period behavior. If a user had high engagement before the experiment, we expect high engagement during, regardless of treatment. Adjusting for pre-period behavior removes this source of variance, turning a 14-day test into 9-10 days with the same power.
SAMPLE RATIO MISMATCH (SRM)
SRM checks verify that traffic splits match expectations. If you expect 50/50 but observe 53/47, something is wrong: client bugs dropping events, biased filtering, or assignment bugs. Use chi-squared test with p < 0.001 to alarm. SRM invalidates all results because the observed difference may come from selection bias, not treatment effect.
⚠️ Key Trade-off: Streaming metrics (30-60 second latency) provide operational visibility but cannot show significance early. Store per-user aggregates to enable bootstrap resampling without reprocessing petabytes.
SCALE CONSIDERATIONS
Large-scale experiments process tens of billions of events daily. Stream aggregation pipelines compute running statistics. Store per-arm per-user aggregates for efficient bootstrap. With 20M daily users, 1.6M per arm reaches 80% power for 5% CTR lift in hours.

💡 Key Takeaways

✓Deterministic hashing: hash(user_id + salt) mod 100 ensures consistent assignment across sessions and devices

✓CUPED reduces variance 10-40% by adjusting for pre-period behavior, shortening tests from 14 days to 9-10 days

✓SRM (Sample Ratio Mismatch) checks verify traffic splits; 53/47 instead of 50/50 invalidates all results

✓Store per-user aggregates for efficient bootstrap without reprocessing petabytes of raw events

📌 Interview Tips

1Explain why deterministic hashing matters: without it, users flip between arms, contaminating both

2Describe CUPED: subtracting regression on pre-period removes variance, accelerating experiments by 30-40%

3Mention SRM as an automatic quality check: chi-squared test with p<0.001 triggers alarm and pause

← Back to Statistical Significance & Confidence Intervals Overview