Trade-offs: Statistical Power, Operational Complexity, and Cost
Statistical Power
Larger holdout = more power to detect differences. With 5% holdout (50K users on 1M total), you can detect 5% relative differences in long-term metrics. With 1% holdout (10K users), you need 10%+ differences to detect reliably. Power depends on holdout size, metric variance, and observation period.
High-variance metrics (revenue, LTV) need larger holdouts. Low-variance metrics (retention) can work with smaller ones. Calculate required holdout size based on your most important long-term metric.
Opportunity Cost
If shipped features improve revenue 10%, a 5% holdout loses 0.5% of total revenue (5% × 10%). For $100M annual revenue, thats $500K/year. This cost must be justified by the value of long-term measurement and catching cumulative harm that would cost more if undetected.
Operational Complexity
Every feature must check holdout status and branch code paths. Support and ops must handle two product versions. Bug fixes must be applied to both paths. Testing must cover both paths. This complexity scales linearly with feature count and holdout duration.