A/B Testing & Experimentation • Interleaving for Ranking ModelsMedium⏱️ ~2 min
What is Interleaving for Ranking Models?
Interleaving is an online evaluation technique that compares two ranking models by blending their outputs into a single list shown to the same user. Instead of splitting traffic between control and treatment groups like traditional A/B testing, interleaving runs both rankers on every request and constructs a combined result list through a round based selection process. Each round, both models propose their highest ranked item not yet shown, a coin flip determines which model places first to eliminate position bias, and user interactions are attributed back to whichever model contributed each item.
The power of interleaving comes from converting a two sample statistical problem into a one sample preference test. Traditional A/B tests estimate two separate noisy distributions (control Click Through Rate (CTR) and treatment CTR) and compare them. Interleaving estimates a single preference margin against a 0.5 baseline, which dramatically reduces variance. Since there is no traffic split, you use 100 percent of users and extract signal even from users who would engage under both rankings.
In production, this translates to extraordinary sensitivity gains. Thumbtack validated that approximately 400 interleaving samples achieved 90 percent agreement with A/B test outcomes that required about 40,000 samples, representing a 100 times sample efficiency. Airbnb reported about 50 times faster sensitivity, running experiments with only 6 percent of A/B traffic and one third of the duration while maintaining 82 percent consistency with A/B conclusions. For ranking systems at companies like Google, Meta, Netflix, and Airbnb, this means shipping better search and recommendation models in days instead of weeks.
💡 Key Takeaways
•Interleaving blends two ranker outputs into one list per user, with coin flips each round determining which model places first to mitigate position bias
•Converts two sample comparison into one sample preference test against 0.5 baseline, reducing variance by using each user as their own control
•Thumbtack achieved 90 percent agreement with A/B using 400 samples versus 40,000 samples, representing 100 times sample efficiency
•Airbnb reported 50 times faster sensitivity using only 6 percent of A/B traffic and one third duration with 82 percent consistency
•Best for ranking tasks with similar models and modest reorderings, where iteration speed matters more than absolute metric estimation
•Requires running both rankers per request, typically adding under 10 milliseconds latency for dual inference plus under 1 millisecond for list merging
📌 Examples
Google Search runs interleaving to compare ranking algorithm tweaks, testing features like query intent classifiers or document scoring weights with hundreds of samples instead of weeks of A/B traffic
Netflix uses interleaving for homepage recommendation ranking, where two models propose title orderings and user engagement (clicks, plays) on competitive slots determines the winner in 2 to 3 days
Airbnb search ranking tests use team draft interleaving with competitive pairs attribution, crediting only items that differ in rank between models to reduce variance on booking conversions