Interleaving vs A/B Testing Trade-offs
WHAT INTERLEAVING GIVES YOU
Speed: 50-100x faster statistical significance. What takes 4 weeks in A/B testing takes 2-5 days with interleaving. Efficiency: Uses the same user as their own control, eliminating between user variance. Iteration velocity: Test more ranking hypotheses per quarter because each experiment takes days instead of weeks.
WHAT INTERLEAVING DOES NOT GIVE YOU
Absolute impact: You learn Model A is preferred over Model B, but not by how much CTR or revenue actually changes. Delayed outcomes: Interleaving runs for days, too short to measure delayed conversions like subscriptions or repeat purchases. Business metrics: Guardrail metrics like page load time, error rates, or revenue require traditional A/B testing with full traffic exposure.
THE THREE STAGE FUNNEL
Production ML teams typically use a three stage evaluation funnel:
Stage 1: Offline metrics (NDCG, MRR) filter 90% of candidates in hours with zero production traffic.
Stage 2: Interleaving ranks the remaining 10% in 2-5 days with 5-20% of traffic.
Stage 3: A/B testing validates the top 1-2 winners over 2-4 weeks with full traffic exposure.
WHEN TO SKIP INTERLEAVING
Skip interleaving when: (1) Models are very different and blending creates incoherent results. (2) You need diversity or fairness constraints that apply to the entire result slate. (3) Your metric requires weeks of observation (subscription retention). (4) You are testing UI changes that affect user browsing behavior independent of ranking.