What is Interleaving for Ranking Models?
THE CORE PROBLEM
Traditional A/B testing for ranking models is painfully slow. You need to split traffic, run for 2-4 weeks, and collect tens of thousands of samples to detect small ranking improvements. The reason: between user variance dominates. User A clicks 10 times per session while User B clicks once. This variation drowns out the signal from your ranking change.
With interleaving, each user serves as their own control. Both models contribute to the same result list, so you are comparing Model A versus Model B within the same user session. This eliminates between user variance entirely, reducing required sample size by 50-100x.
HOW IT WORKS AT A HIGH LEVEL
For each user query, run both rankers and merge their outputs using an algorithm like Team Draft. Each item in the blended list is tagged with which model proposed it. When the user clicks or engages, credit goes to that model. After collecting enough sessions (typically hundreds to a few thousand), test whether one model wins significantly more than 50% of competitive engagements.