A/B Testing & ExperimentationInterleaving for Ranking ModelsMedium⏱️ ~2 min

What is Interleaving for Ranking Models?

Definition
Interleaving is an online evaluation technique that compares two ranking models by blending their outputs into a single list shown to users. Instead of splitting traffic between control and treatment groups (like A/B testing), interleaving shows both models to the same user simultaneously and measures which one wins more user engagement.

THE CORE PROBLEM

Traditional A/B testing for ranking models is painfully slow. You need to split traffic, run for 2-4 weeks, and collect tens of thousands of samples to detect small ranking improvements. The reason: between user variance dominates. User A clicks 10 times per session while User B clicks once. This variation drowns out the signal from your ranking change.

With interleaving, each user serves as their own control. Both models contribute to the same result list, so you are comparing Model A versus Model B within the same user session. This eliminates between user variance entirely, reducing required sample size by 50-100x.

HOW IT WORKS AT A HIGH LEVEL

For each user query, run both rankers and merge their outputs using an algorithm like Team Draft. Each item in the blended list is tagged with which model proposed it. When the user clicks or engages, credit goes to that model. After collecting enough sessions (typically hundreds to a few thousand), test whether one model wins significantly more than 50% of competitive engagements.

💡 Key Insight: Interleaving converts comparison from two independent samples (A/B) to paired comparison (same user, same query), dramatically increasing statistical power.
💡 Key Takeaways
Blends outputs from two ranking models into one list per user, eliminating between user variance that slows A/B testing
Each user serves as their own control, reducing required sample size by 50-100x compared to traditional A/B tests
Determines relative preference (which model is better) but not absolute impact (how much CTR improves)
Typically reaches statistical significance in 2-5 days with hundreds to thousands of sessions instead of weeks
📌 Interview Tips
1When asked about ranking model evaluation, explain that interleaving provides 50-100x faster results than A/B by using paired comparison within the same user session
2If discussing sample efficiency, mention that what takes 40,000 samples in A/B testing can be detected with 400 samples using interleaving
3Show depth by explaining that interleaving reveals preference but not magnitude, so you still need A/B testing for absolute metric impact
← Back to Interleaving for Ranking Models Overview
What is Interleaving for Ranking Models? | Interleaving for Ranking Models - System Overflow