A/B Testing & Experimentation • Holdout Groups & Long-term ImpactEasy⏱️ ~2 min
What Are Holdout Groups and Why Do They Matter?
Holdout groups are long lived control cohorts deliberately excluded from product changes to measure true long term impact. Unlike standard A/B tests that run for one to three weeks with 50/50 splits, holdouts reserve a small percentage of users (typically 1 to 10 percent) from receiving features for months, allowing you to compare metrics between the shipped population and the holdout to estimate durable value.
This approach is critical for ML driven products where short term experiments miss three key dynamics. First, novelty effects inflate early gains that decay over time. Second, feedback loops cause model outputs to influence user behavior, which feeds back into training data. Third, seasonal patterns and cross feature cannibalization only emerge over quarters. Pinterest ran a 1 percent holdout on notification badging for over a year and observed the initial 7 percent Daily Active User (DAU) lift decay to 2.5 percent baseline, then later increase to 4 percent after a related feature shipped.
The pattern comes in two flavors. Local or feature holdouts isolate one component like notifications or a recommender. Universal holdouts, used at Disney Streaming, reserve a small global share of users who receive none of the changes over a fixed period. Disney runs a 3 month enrollment period followed by a 1 month evaluation period, powered to detect 1 percent changes in hours watched per subscriber, a financially meaningful metric aligned with revenue sensitivity.
Holdouts capture what standard experiments cannot: cumulative effects across many shipped features. Disney found that naive summation of per experiment effects overestimates total impact because cannibalization and temporal decay pull the aggregate below the sum. This feedback is essential for ML teams to understand whether their roadmap delivers compounding value or creates interference.
💡 Key Takeaways
•Holdouts run for 3 to 12 months with small splits (1 to 10 percent held out) versus standard A/B tests at 1 to 3 weeks with 50/50 splits, trading power for long term validity
•Pinterest saw notification badging DAU lift decay from 7 percent initially to 2.5 percent baseline over a year, illustrating novelty effects that short experiments miss
•Disney Streaming powers their universal holdout to detect 1 percent changes in hours watched per subscriber, aligning minimum detectable effect (MDE) with revenue materiality
•Universal holdouts measure cumulative impact across all features and reveal that summing individual experiment effects overestimates reality due to cannibalization
•ML systems need holdouts because feedback loops (model outputs influence user behavior which influences training data) create long term dynamics invisible to short tests
📌 Examples
Disney Streaming universal holdout: 3 month enrollment + 1 month evaluation, 1 to 5 percent holdout, measures hours watched per subscriber across all shipped ML ranking and recommendation changes
Pinterest notification badging holdout: 1 percent of users held out for 12+ months, observed 7% → 2.5% → 4% DAU lift trajectory as novelty decayed then synergy effects emerged