Statistical Analysis and Preference Margins
THE PREFERENCE MARGIN
Interleaving analysis estimates a preference margin: the proportion of competitive engagements won by the treatment model. For each query or session, count how many clicks landed on treatment team items versus control team items. Neutral items (where both models agreed) are excluded. The preference margin is: treatment_clicks / (treatment_clicks + control_clicks).
STATISTICAL TESTING
The null hypothesis is that both models are equally good, meaning the expected preference margin is 0.5. You run a one sample t-test or z-test comparing the observed mean preference across sessions against 0.5. A p-value below 0.05 indicates statistically significant preference for one model.
For binary outcomes (session had at least one competitive click), you can use a binomial test. For continuous outcomes (fraction of clicks per session), use a t-test. Both approaches are valid; the t-test is more common because it handles variable engagement rates better.
SAMPLE SIZE AND POWER
Interleaving typically reaches 80% statistical power with 400-2,000 competitive sessions, compared to 20,000-50,000 for A/B testing. This 50-100x efficiency comes from eliminating between user variance. One user who clicks 10 times and another who clicks once both contribute equally to preference estimation.
ATTRIBUTION METHODS
In multi query sessions, the same item might appear under different teams. First click attribution credits only the initial interaction. Last click attribution credits the final interaction before conversion. All clicks attribution counts every interaction. Empirically, first click attribution tends to correlate best with A/B test outcomes, showing 80-85% agreement on winner direction.