A/B Testing & Experimentation • Interleaving for Ranking ModelsMedium⏱️ ~2 min
Statistical Analysis and Preference Margins
Interleaving analysis estimates a preference margin, the proportion of competitive interactions won by the treatment model, and tests whether it significantly exceeds 0.5. For each query or session, compute the preference as the fraction of competitive clicks credited to treatment divided by total competitive clicks for that query. A competitive click is one on an item where the two models disagreed on ranking. Neutral items where both models agreed are excluded from the denominator, which reduces noise and focuses the test on actual model differences.
Across all queries, you collect a sample of per query preference scores, each between 0 and 1. The null hypothesis is that the population mean preference equals 0.5, meaning neither model is better. Thumbtack uses a one sample Z test of proportions against 0.5. Airbnb uses a one sample t test on the preference margin. Both approaches test the same hypothesis but differ in distributional assumptions. The t test is more conservative and robust when sample sizes are modest or preference distributions are skewed. Statistical significance is typically declared at p value less than 0.05, and experiments run until reaching target power, often 80 percent, or until a maximum duration like 7 days.
The key advantage is variance reduction through within user comparisons. Traditional A/B tests suffer from between user variance in engagement propensity. Interleaving makes each user their own control, so differences in user quality, session context, or query difficulty cancel out. This is why Thumbtack reported 100 times sample efficiency and Airbnb reported 50 times faster decisions. In practice, a few hundred to a few thousand sessions often suffice to detect winners when effect sizes are 1 to 3 percent improvements in ranking quality. At 10,000 queries per second, this translates to results in hours or a couple of days rather than weeks.
💡 Key Takeaways
•Preference margin is fraction of competitive clicks won by treatment, tested against null hypothesis of 0.5 using one sample t test or Z test
•Competitive clicks exclude neutral items where models agreed, focusing statistical power on actual ranking differences and reducing noise
•Within user comparison eliminates between user variance in engagement propensity, enabling 50 to 100 times sample efficiency over A/B testing
•Thumbtack validated 400 interleaving samples matched A/B outcomes requiring 40,000 samples with 90 percent agreement on ranking quality
•Airbnb uses competitive pairs attribution for downstream events like bookings, crediting only items with different ranks to reduce variance by 30 to 40 percent
•Typical experiments reach 80 percent power with few hundred to few thousand sessions, translating to hours or days at moderate traffic instead of weeks
📌 Examples
Netflix recommendation ranking runs interleaving with per session preference computed as treatment plays divided by total competitive plays, testing mean preference across 2,000 sessions against 0.5 baseline to detect 2 percent engagement lift in 3 days
Google Search uses one sample Z test on click preferences, declaring winner when p value drops below 0.05 or after 5 days maximum duration, whichever comes first
Airbnb evaluates first click, last click, and all clicks attribution methods for multi query sessions, selecting the method with highest consistency to A/B test outcomes at 82 percent agreement