A/B Testing & ExperimentationHoldout Groups & Long-term ImpactHard⏱️ ~2 min

Implementation: Gating, Analytics, and Dual Path Management

Place gating early in the request path and centralize it. For application features, a top level feature flag layer should check universal holdout first, then feature level holdouts, then normal flags. For marketing systems, perform holdout filtering after upstream optimizations like send time optimization to ensure you measure the full stack as actually experienced. Bluecore applied holdouts after optimization features and isolated the logic so that performance metrics remain interpretable. This avoids the trap where holdouts are applied too early and you measure a system configuration that does not reflect production. Power and analysis require upfront planning. A 1 to 99 split is underpowered for small effects; increase to 5 to 95 or 10 to 90 when feasible. Plan for at least 3 months. Power to detect effects of 1 percent or larger on core metrics aligned with revenue. Use variance reduction methods like Covariate Using Pre Experiment Data (CUPED) and stratify by known confounders. For portfolio level inference, compute aggregate impact with attention to overlapping user experiences. Disney found that summing individual experiment effects overestimates reality because of cannibalization, so they compute the direct comparison between universal holdout and shipped population. Plan for exceptions and coordinate across teams. Security updates and critical bug fixes should bypass holdouts. Document an approval and override path, log exceptions with reason codes, and communicate that one team violating the protocol can invalidate months of data. Disney noted the infrastructure and code complexity of many concurrent branches. Invest in quality assurance and product management oversight to prevent regressions in holdout compliance. If dual model maintenance for months is too expensive, consider snapshotting features, freezing certain components, or using off policy evaluation techniques to approximate long term effects. Rotate long lived universal holdouts quarterly to reduce user frustration. Finally, if holdouts are not feasible, extend the original experiment or adopt causal inference methods like difference in differences or Bayesian structural time series. These require strong assumptions but avoid the operational burden of months long dual paths.
💡 Key Takeaways
Gate at top level: check universal holdout first (5 percent), then feature holdouts, then normal flags; apply holdouts after upstream optimizations to measure production stack as experienced
Power for at least 3 months with 5 to 95 or 10 to 90 splits to detect 1 percent effects; use variance reduction (CUPED), stratification, and avoid summing individual experiments due to cannibalization
Disney found summing individual experiment effects overestimates cumulative impact; compute direct comparison between universal holdout and shipped population for portfolio level inference
Document exception paths for security updates and critical bugs with approval process and reason codes; communicate that one team violating holdouts invalidates months of data
If dual path maintenance is too expensive, snapshot features, freeze components, or use off policy evaluation; rotate universal holdouts quarterly to reduce permanent deprivation of same users
📌 Examples
Bluecore holdout gating: apply after send time optimization so measurement reflects full production stack including optimization features, not a hypothetical baseline
Disney Streaming dual path management: some ML changes too expensive to maintain for 4 months, selectively apply holdouts to highest impact features and use off policy evaluation for others
← Back to Holdout Groups & Long-term Impact Overview