A/B Testing & Experimentation • Interleaving for Ranking ModelsMedium⏱️ ~2 min
Team Draft Interleaving Algorithm
Team draft interleaving constructs the blended result list through a round based selection process that ensures fairness and minimizes position bias. In each round, both models propose their highest ranked item not yet included in the output. A coin flip determines which model gets to place its item first for that round. If both models propose the identical item, it is included once but marked as neutral, meaning no team gets credit for interactions with it. If the models propose different items, both items are added to the list with their respective team labels. The process continues until the desired number of results K is reached, typically 10 to 50 items for search or recommendation systems.
The coin flip per round is critical for eliminating position bias. Without randomization, the model that always goes first would systematically benefit from the higher click through rates at top positions. By flipping each round, both models have equal expected representation at every rank position across many queries. The neutral item handling when both models agree serves two purposes. First, it prevents double counting the same item. Second, it focuses attribution on the competitive slots where models actually differ, which is where the preference signal lives.
In production systems at 10,000 queries per second with 120 millisecond median latency budgets, the team draft merge operation is remarkably efficient. It runs in linear time proportional to K and involves mostly pointer operations rather than item copying. Implementation typically adds under 1 millisecond to request latency. The real cost comes from running two rankers instead of one, which can double model inference time. Teams mitigate this by caching shared features between models, gating interleaving to a traffic subset, or optimizing model serving infrastructure to keep added latency under 10 milliseconds total.
💡 Key Takeaways
•Each round both models propose highest ranked item not yet shown, coin flip decides which places first, eliminating systematic position bias
•When both models propose same item it is included once as neutral with no team credit, focusing attribution on competitive differences
•Algorithm runs in linear time O(K) with mostly pointer operations, adding under 1 millisecond to request latency for typical K of 10 to 50 items
•Real cost is dual ranker inference which can double compute, mitigated by feature caching to keep total added latency under 10 milliseconds
•Balanced variants ensure equal expected representation at each rank, critical when click propensity drops steeply from position 1 to 10
•Logging must capture query id, item id, slot index, team label, neutral flag, and all engagement signals for downstream attribution
📌 Examples
Meta News Feed ranking runs team draft on each user request, flipping coins per round to decide whether current or candidate model places first, then attributes likes and clicks to teams
Airbnb search interleaving uses competitive pairs variant, deduplicating identical items across lists and only attributing interactions on items with different ranks between models
Google Search maintains per query first position balance within 2 percent across experiments, monitoring that team A and team B each start first in about 50 percent of rounds