Recommendation SystemsEvaluation Metrics (Precision@K, NDCG, Coverage)Hard⏱️ ~2 min

Production Evaluation: Scale, Debiasing, and Failure Modes

Core Concept
Offline metrics evaluate models on historical data before deployment. Fast iteration, no user impact. But offline success does not guarantee online success. The gap between offline and online performance is one of the hardest problems in recommendation systems.

Why Offline and Online Diverge

Selection bias: Offline data only contains items users actually saw. If old model never showed item X to user Y, you have no signal for that pair. Your new model might think X is great for Y, but you cannot verify offline.

Position bias: Users click position 1 more than position 10 regardless of relevance. Offline data does not separate "clicked because relevant" from "clicked because visible." Models trained on this data inherit the bias.

Mitigating the Gap

Randomized data collection: Reserve 5-10% of traffic for random or uniformly sampled recommendations. This exploration data is essential for unbiased model comparison. This creates unbiased data for offline evaluation. Hurts short-term engagement but improves model quality long-term.

Inverse propensity weighting: Weight each offline example by 1/(probability it was shown). Items shown rarely get high weight, correcting for selection bias. Requires logging which items were candidates, not just which were shown.

✅ Best Practice: Never deploy based on offline metrics alone. Always A/B test. A model with 5% higher NDCG offline might show 0% improvement online or even regress. The correlation between offline and online varies by system. Track and quantify this correlation for your specific use case.
💡 Key Takeaways
Production evaluation scale: hundreds of millions to billions of predictions per sweep, nightly distributed compute, bootstrap confidence intervals to detect sub 1% changes
Position bias correction mandatory: clicks biased by rank, debias with inverse propensity weighting (reweight by 1 over examination probability), randomized interleaving, or unbiased logging via randomized slots
Sparse ground truth handling: macro average across users with minimum 3 to 5 positives, exclude cold start users or report user coverage separately (fraction with at least 1 relevant), avoid volatile per user scores
Temporal staleness: labels older than 4 weeks overestimate performance on fresh content, stratify by content age (0 to 7 days, 7 to 30 days, 30 plus days), use rolling 7 to 28 day windows
Offline to online correlation: expect 10% to 30% of offline wins to fail online due to bias, covariate shift, seasonality, use offline for filtering then validate top 2 to 3 models in A/B tests with business metrics
Metric gaming: optimizing 30 second watch threshold incentivizes clickbait, add negative signals (early exit, hide, dislike) and multiple engagement thresholds as guardrails
📌 Interview Tips
1When asked about offline evaluation: explain using held-out test sets with logged interactions; compute metrics on historical data before expensive online experiments.
2For bias correction: mention that logged data has position bias (top items clicked more); use IPS (inverse propensity scoring) or unbiased labels from randomized traffic.
3When discussing correlation: explain that offline gains dont always translate online - a 5% offline NDCG improvement might yield 0-2% online metric lift; establish offline-online correlation for your domain.
← Back to Evaluation Metrics (Precision@K, NDCG, Coverage) Overview
Production Evaluation: Scale, Debiasing, and Failure Modes | Evaluation Metrics (Precision@K, NDCG, Coverage) - System Overflow