A/B Testing & ExperimentationHoldout Groups & Long-term ImpactHard⏱️ ~2 min

Real World Case Study: Disney Streaming Universal Holdout

Disney Streaming implemented a universal holdout to measure the cumulative long term impact of their ML driven ranking and recommendation changes across the platform. They reserve a small percentage (1 to 5 percent) of subscribers in a holdout cohort that receives none of the shipped changes over a fixed period. The design uses a 3 month enrollment period where eligible users are sampled, followed by a 1 month evaluation period where the cohort is frozen and no new users are added. This 4 month cycle balances the need for long term measurement with the operational cost of maintaining dual code paths. The holdout is powered to detect a 1 percent change in hours watched per subscriber, a core engagement metric directly tied to revenue and retention. This threshold reflects the financial materiality of improvements rather than chasing statistical significance on less meaningful metrics. The choice of hours watched over secondary metrics like click through rate or session count ensures alignment between experimentation and business outcomes. Disney found that naive summation of individual A/B test effects significantly overestimates the actual aggregate impact. When multiple recommendation and ranking features ship in parallel, cannibalization and interference effects reduce the combined lift below what you would predict by adding up each experiment's measured benefit. The universal holdout provides the ground truth by comparing users who received all changes against those who received none, capturing these cross feature dynamics that individual experiments miss. Operationally, maintaining dual paths for all ML components over 4 months proved prohibitively expensive for some changes. Disney selectively applies the universal holdout to the highest impact features and uses alternative methods like off policy evaluation or extended short term experiments for changes where dual mode is too costly. They also observed that some initially successful experiments showed decreasing or inconsistent impact over the full 4 month window, validating the need for long term measurement to separate durable improvements from temporary novelty effects.
💡 Key Takeaways
Disney Streaming universal holdout: 1 to 5 percent of subscribers held out for 3 month enrollment plus 1 month evaluation (4 months total) to measure cumulative ML ranking and recommendation impact
Powered to detect 1 percent change in hours watched per subscriber, aligning minimum detectable effect with revenue materiality and business outcomes rather than vanity metrics
Found that summing individual A/B test effects overestimates aggregate impact due to cannibalization and interference; universal holdout provides ground truth by capturing cross feature dynamics
Dual path maintenance for all ML components over 4 months is prohibitively expensive for some changes; selectively applied to highest impact features with off policy evaluation as fallback
Some initially successful experiments showed decreasing or inconsistent impact over 4 months, validating long term measurement to separate durable improvements from temporary novelty effects
📌 Examples
Disney Streaming cohort design: 3 month enrollment window samples eligible users, 1 month evaluation window freezes cohort, then metrics compared between 1 to 5 percent holdout and 95 to 99 percent shipped population
Disney cannibalization finding: if Experiment A measures +2% hours watched and Experiment B measures +3% hours watched, the universal holdout might show only +4% combined instead of +5%, revealing 1% interference
← Back to Holdout Groups & Long-term Impact Overview