Experimentation at Scale: Randomization, Metrics, and Variance Reduction
RANDOMIZATION MECHANICS
Use deterministic hashing for assignment: hash(user_id + salt) mod 100. This ensures consistent assignment across sessions and devices. The salt changes per experiment to decorrelate overlapping experiments. Without deterministic hashing, the same user might flip between treatment and control, contaminating both arms.
CUPED: VARIANCE REDUCTION
CUPED (Controlled-experiment Using Pre-Experiment Data) reduces variance by 10-40% by subtracting the regression on pre-period behavior. If a user had high engagement before the experiment, we expect high engagement during, regardless of treatment. Adjusting for pre-period behavior removes this source of variance, turning a 14-day test into 9-10 days with the same power.
SAMPLE RATIO MISMATCH (SRM)
SRM checks verify that traffic splits match expectations. If you expect 50/50 but observe 53/47, something is wrong: client bugs dropping events, biased filtering, or assignment bugs. Use chi-squared test with p < 0.001 to alarm. SRM invalidates all results because the observed difference may come from selection bias, not treatment effect.
SCALE CONSIDERATIONS
Large-scale experiments process tens of billions of events daily. Stream aggregation pipelines compute running statistics. Store per-arm per-user aggregates for efficient bootstrap. With 20M daily users, 1.6M per arm reaches 80% power for 5% CTR lift in hours.