A/B Testing & ExperimentationExperiment Design (Randomization, Stratification, Power Analysis)Medium⏱️ ~3 min

What is Power Analysis and Why Does Sample Size Matter?

Power analysis connects four quantities: sample size, effect size (minimum detectable effect or MDE), significance level (alpha, typically 5 percent), and statistical power (1 minus beta, typically 80 to 90 percent). Power is the probability of detecting a true effect when it exists. Underpowered experiments fail to detect real improvements, wasting engineering effort. Overpowered experiments consume traffic and time unnecessarily, delaying other tests. MDE is chosen from business context. For a conversion funnel, a 1 percent relative lift might be worth millions in annual revenue, so that becomes the target. For a latency improvement, a 10 millisecond reduction at p99 might be the threshold where user experience changes. Sample size grows with outcome variance and shrinks with larger effect sizes. A binary conversion metric at 2 percent baseline with 10 percent relative lift target might require 30,000 users per variant at 80 percent power and 5 percent alpha. Doubling the MDE to 20 percent relative lift cuts the required sample by 75 percent. Historical variance and baseline rates are pulled from a metrics store using recent traffic slices. For continuous metrics, use pooled variance across control and treatment. For binary metrics, variance is p times (1 minus p). Account for clustering or autocorrelation if units are not independent. Duration is then calculated by dividing required sample by daily eligible traffic. Seasonality and holidays inflate variance and reduce effective sample, so many teams add 10 to 20 percent buffer to duration estimates. Underpowered tests create exaggeration bias. At 50 percent power, significant positive results overstate the true effect by roughly 40 percent on average because only the largest observed effects cross the significance threshold. At 80 percent power, exaggeration is about 10 to 20 percent. Microsoft Bing applies a 20 percent haircut to early uplift estimates for business forecasting. Teams should set higher power targets (85 to 90 percent) for high variance metrics and avoid launching based on barely significant results from underpowered experiments.
💡 Key Takeaways
Power of 80 percent means a 20 percent chance of missing a real effect; 90 percent power is better for high variance or critical metrics
MDE of 1 percent relative conversion lift at 2 percent baseline requires roughly 30,000 users per variant at 80 percent power and 5 percent alpha
Doubling MDE from 10 percent to 20 percent relative lift reduces required sample size by 75 percent due to quadratic relationship
Underpowered tests at 50 percent power exaggerate observed significant effects by roughly 40 percent; 80 percent power exaggerates by 10 to 20 percent
Microsoft Bing applies a 20 percent haircut to early uplift estimates for business planning to account for exaggeration bias
Duration calculation divides required sample by daily eligible traffic, then adds 10 to 20 percent buffer for seasonality and variance inflation
📌 Examples
Meta runs feed ranking experiments with 90 percent power targeting 0.5 percent relative lift in time spent, requiring 2 million users and 14 days given daily active user base
Airbnb pricing experiments target 2 percent MDE in booking rate with 85 percent power, using 52 week preperiod data to estimate baseline variance
Netflix recommendation experiments use historical CTR variance from the metrics store to compute duration, typically 4 to 6 weeks for 1 percent relative MDE at 80 percent power
← Back to Experiment Design (Randomization, Stratification, Power Analysis) Overview