Learn→A/B Testing & Experimentation→Experiment Design (Randomization, Stratification, Power Analysis)→2 of 6
A/B Testing & Experimentation • Experiment Design (Randomization, Stratification, Power Analysis)Medium⏱️ ~3 min
How Does Stratification Reduce Variance in Experiments?
Stratification, also called blocking, partitions experiment units into homogeneous groups using pre-experiment covariates like device type, country, language, or baseline engagement level. Within each stratum, units are randomly assigned to control or treatment with fixed allocation ratios (typically 50/50). This ensures balance on important covariates and reduces variance by isolating heterogeneity between strata rather than letting it inflate within-group noise.
The variance reduction translates directly to lower minimum detectable effect (MDE) and shorter experiment duration. In production ML systems at Netflix and Airbnb, stratification by device type and country commonly reduces MDE by 10 to 30 percent depending on how much the outcome varies across strata. For example, mobile users might have 1.5 percent conversion while desktop users have 3.2 percent conversion. Without stratification, this heterogeneity increases pooled variance. With stratification, the analysis compares mobile treatment to mobile control and desktop treatment to desktop control, then aggregates using inverse variance weighting.
Implementation requires a fast lookup cache that maps unit IDs to stratum assignments before randomization. The assignment service checks the stratum, computes the hash within that stratum bucket, and enforces the allocation ratio. Keep stratum count manageable; splitting into more than 20 to 50 strata risks creating underpowered slices where the smallest stratum lacks sufficient traffic. A practical rule is each stratum should contain at least 1,000 to 5,000 units to maintain reasonable power.
Stratification adds operational complexity. You must instrument stratum assignment, coordinate it across services, and adjust analysis pipelines to compute stratified estimates and aggregate them correctly. The payoff is substantial when heterogeneity is large. For a conversion rate experiment where country explains 25 percent of variance, stratifying by top 10 countries can cut required duration from 8 weeks to 6 weeks, saving two weeks of opportunity cost and reducing exposure risk.
💡 Key Takeaways
•Stratifying by device and country typically reduces MDE by 10 to 30 percent when these covariates explain substantial outcome variance
•Each stratum should contain at least 1,000 to 5,000 units to maintain power; avoid creating more than 20 to 50 strata to prevent underpowered slices
•Assignment service enforces 50/50 allocation within each stratum separately, requiring a fast lookup cache for stratum membership (sub millisecond reads)
•Analysis computes separate treatment effects per stratum, then aggregates using inverse variance weighting to produce the overall estimate
•Stratification can cut experiment duration from 8 weeks to 6 weeks when heterogeneity is high, saving opportunity cost and reducing risk exposure
•Operational complexity increases because stratum assignment must be coordinated across services and instrumented in logging pipelines
📌 Examples
Spotify stratifies playlist recommendation experiments by user engagement tier (low, medium, high) and country, reducing MDE by 22 percent for click through rate
Uber stratifies driver incentive experiments by city size and historical trip volume, improving precision and ensuring balance across diverse markets
Google Search stratifies ranking experiments by query type (navigational, informational, transactional) to isolate variance from query intent