Core Concept
Offline metrics evaluate models on historical data before deployment. Fast iteration, no user impact. But offline success does not guarantee online success. The gap between offline and online performance is one of the hardest problems in recommendation systems.
Why Offline and Online Diverge
Selection bias: Offline data only contains items users actually saw. If old model never showed item X to user Y, you have no signal for that pair. Your new model might think X is great for Y, but you cannot verify offline.
Position bias: Users click position 1 more than position 10 regardless of relevance. Offline data does not separate "clicked because relevant" from "clicked because visible." Models trained on this data inherit the bias.
Mitigating the Gap
Randomized data collection: Reserve 5-10% of traffic for random or uniformly sampled recommendations. This exploration data is essential for unbiased model comparison. This creates unbiased data for offline evaluation. Hurts short-term engagement but improves model quality long-term.
Inverse propensity weighting: Weight each offline example by 1/(probability it was shown). Items shown rarely get high weight, correcting for selection bias. Requires logging which items were candidates, not just which were shown.
✅ Best Practice: Never deploy based on offline metrics alone. Always A/B test. A model with 5% higher NDCG offline might show 0% improvement online or even regress. The correlation between offline and online varies by system. Track and quantify this correlation for your specific use case.
✓Production evaluation scale: hundreds of millions to billions of predictions per sweep, nightly distributed compute, bootstrap confidence intervals to detect sub 1% changes
✓Position bias correction mandatory: clicks biased by rank, debias with inverse propensity weighting (reweight by 1 over examination probability), randomized interleaving, or unbiased logging via randomized slots
✓Sparse ground truth handling: macro average across users with minimum 3 to 5 positives, exclude cold start users or report user coverage separately (fraction with at least 1 relevant), avoid volatile per user scores
✓Temporal staleness: labels older than 4 weeks overestimate performance on fresh content, stratify by content age (0 to 7 days, 7 to 30 days, 30 plus days), use rolling 7 to 28 day windows
✓Offline to online correlation: expect 10% to 30% of offline wins to fail online due to bias, covariate shift, seasonality, use offline for filtering then validate top 2 to 3 models in A/B tests with business metrics
✓Metric gaming: optimizing 30 second watch threshold incentivizes clickbait, add negative signals (early exit, hide, dislike) and multiple engagement thresholds as guardrails