A/B Testing & ExperimentationInterleaving for Ranking ModelsMedium⏱️ ~2 min

Statistical Analysis and Preference Margins

THE PREFERENCE MARGIN

Interleaving analysis estimates a preference margin: the proportion of competitive engagements won by the treatment model. For each query or session, count how many clicks landed on treatment team items versus control team items. Neutral items (where both models agreed) are excluded. The preference margin is: treatment_clicks / (treatment_clicks + control_clicks).

STATISTICAL TESTING

The null hypothesis is that both models are equally good, meaning the expected preference margin is 0.5. You run a one sample t-test or z-test comparing the observed mean preference across sessions against 0.5. A p-value below 0.05 indicates statistically significant preference for one model.

For binary outcomes (session had at least one competitive click), you can use a binomial test. For continuous outcomes (fraction of clicks per session), use a t-test. Both approaches are valid; the t-test is more common because it handles variable engagement rates better.

SAMPLE SIZE AND POWER

Interleaving typically reaches 80% statistical power with 400-2,000 competitive sessions, compared to 20,000-50,000 for A/B testing. This 50-100x efficiency comes from eliminating between user variance. One user who clicks 10 times and another who clicks once both contribute equally to preference estimation.

💡 Key Insight: A 2% preference margin (52% vs 48%) is detectable in days with interleaving but would take weeks with A/B testing to distinguish from noise.

ATTRIBUTION METHODS

In multi query sessions, the same item might appear under different teams. First click attribution credits only the initial interaction. Last click attribution credits the final interaction before conversion. All clicks attribution counts every interaction. Empirically, first click attribution tends to correlate best with A/B test outcomes, showing 80-85% agreement on winner direction.

💡 Key Takeaways
Preference margin = treatment clicks / total competitive clicks, tested against null hypothesis of 0.5
Neutral items (both models agreed) are excluded from attribution to focus power on actual ranking differences
Typically 400-2,000 sessions to reach 80% power, compared to 20,000-50,000 for equivalent A/B test
First click attribution typically shows 80-85% agreement with A/B test outcomes on winner direction
📌 Interview Tips
1When discussing statistical analysis, explain the formula and mention you are testing against 0.5 baseline with a one sample t-test
2Mention the attribution trade-off: first click works best for ranking but last click may be better for conversion optimization
3Show quantitative depth by citing 50-100x sample efficiency and 80% power with hundreds to low thousands of sessions
← Back to Interleaving for Ranking Models Overview