Interleaving Failure Modes and Edge Cases
HIGH OVERLAP PROBLEM
When both models produce nearly identical rankings, most slots become neutral (no team credit). If the top 10 items match for 80% of queries, you only get competitive signal from 20% of slots. This dramatically reduces effective sample size. Monitor competitive coverage: the fraction of slots that are non neutral. If coverage drops below 30%, you need 3-5x more samples or longer experiment duration.
POSITION BIAS RESIDUALS
Users click higher positions more often regardless of item quality. Position 1 gets 5-10x more clicks than position 5. If one model systematically wins the coin flip more often for top positions, it will appear better even if the models are equal. Monitor first position balance: each team should start first in approximately 50% of rounds, within 2% tolerance.
SLATE LEVEL CONSTRAINTS
Models that optimize diversity (show 3 different content types) or fairness (balance across item categories) break under interleaving. The blended list destroys intended slate level properties. A diversity optimized model might ensure 3 genres in top 10, but after interleaving with an engagement optimized model, you might get 5 items from one genre. The diversity model looks worse because its constraint was violated.
SESSION ATTRIBUTION LEAKAGE
In multi query sessions, the same item may appear under different teams across queries. User sees item X as Team A in search, then as Team B in recommendations. If they click, which team gets credit? First click attribution credits the initial exposure. Competitive pairs attribution only credits items where models disagreed on rank, reducing variance by 30-40%.