A/B Testing & ExperimentationStatistical Significance & Confidence IntervalsEasy⏱️ ~2 min

Statistical Significance: Understanding P-Values and Type I/II Errors

Statistical significance tests whether an observed effect is real or just random noise. You start with a null hypothesis (usually that there's no difference between variants) and choose a significance level called alpha, typically 0.05. You then compute a p-value from your data. If the p-value is less than alpha, you reject the null hypothesis. The alpha level represents your Type I error rate, the probability of declaring a winner when there actually is no difference. At alpha equals 0.05, you'll falsely declare significance 5 percent of the time in the long run. Power is the flip side: the probability of correctly detecting a real effect. Google and Meta typically target 80 percent power, meaning if a true difference exists, you'll detect it 80 percent of the time. Type II error is the other failure mode: missing a real effect (probability equals 1 minus power, so 20 percent with 80 percent power). Sample size dramatically affects power. To detect a 5 percent relative Click-Through Rate (CTR) lift from 2.00 percent to 2.10 percent with alpha 0.05 and 80 percent power, you need roughly 1.6 million users per arm. For a rare event like a 0.05 percent purchase rate with a 10 percent relative lift, you need over 30 million users per arm, potentially requiring one to two weeks of runtime. Critical distinction: statistical significance does not equal practical significance. An effect can be statistically significant (p less than 0.05) but too small to matter. With 50 million users, you might detect a 0.01 percent CTR increase as statistically significant, but that's probably not worth the engineering cost or potential risks of shipping a new model.
💡 Key Takeaways
Alpha at 0.05 means 5 percent false positive rate: you'll incorrectly declare a winner 1 in 20 times when there's no real difference
Power of 80 percent means you'll detect a true effect 80 percent of the time, requiring larger samples for smaller effects or lower variance metrics
Sample size scales with inverse square of effect size: detecting a 2 percent lift needs 4 times more users than detecting a 4 percent lift
To detect 5 percent relative CTR lift from 2.00 percent baseline requires 1.6 million users per arm with alpha 0.05 and 80 percent power
Statistical significance does not imply business value: with 50 million users you can detect tiny meaningless effects as significant
Rare events like 0.05 percent purchase rates need 30 million plus users per arm for 10 percent relative lift detection, taking one to two weeks
📌 Examples
Meta feed ranking experiment: baseline CTR 2.00 percent, new model shows 2.10 percent, need 1.6M users per arm to detect with 80 percent power
Netflix thumbnail test: with 100M daily users, can detect 0.1 percent watch time lift in under 24 hours, but must ask if it matters to business
Uber pricing experiment: 0.05 percent conversion rate on premium rides requires 30M users per arm for 10 percent lift, running 10 to 14 days
Google Search quality: with billions of queries, can detect 0.01 percent CTR changes as significant but uses business thresholds to filter noise
← Back to Statistical Significance & Confidence Intervals Overview
Statistical Significance: Understanding P-Values and Type I/II Errors | Statistical Significance & Confidence Intervals - System Overflow