Statistical Significance: Understanding P-Values and Type I/II Errors
TYPE I AND TYPE II ERRORS
Type I error (false positive): You declare a winner when there is no real difference. The alpha level (typically 0.05) controls this: 5% chance of false positive.
Type II error (false negative): You fail to detect a real difference. Power (typically 80%) controls this: 80% chance of detecting a true effect, 20% chance of missing it.
SAMPLE SIZE REQUIREMENTS
Sample size scales with the inverse square of effect size. To detect a 2% lift, you need 4x more users than detecting a 4% lift. Concrete example: detecting a 5% relative CTR lift from a 2.0% baseline requires about 1.6 million users per arm with alpha=0.05 and 80% power. Rare events like 0.05% purchase rates need 30 million+ users per arm.
PRACTICAL IMPLICATIONS
High traffic systems can detect very small effects (0.1%) in hours. Low conversion events (purchases, subscriptions) need weeks. Plan experiment duration based on your traffic and minimum detectable effect, not arbitrary timelines.