Recommendation Systems • Evaluation Metrics (Precision@K, NDCG, Coverage)Hard⏱️ ~2 min
Production Evaluation: Scale, Debiasing, and Failure Modes
Offline evaluation at production scale means computing metrics over hundreds of millions to billions of predictions per model sweep, often in nightly distributed jobs. The goal is to detect sub 1% changes in Precision@K or NDCG@K with statistical confidence before committing to expensive online A/B tests. Typical pipeline: sample recent user activity (last 7 to 28 days), generate predictions for candidate models, compute per user metrics, aggregate with bootstrap confidence intervals to handle skewed distributions.
Position bias is the biggest trap. Users click items ranked higher simply because they see them first, independent of true relevance. Training on these biased logs and evaluating on them overestimates your model's quality and makes you think you've improved when you've just learned to mimic the old biased ranking. Solutions: collect unbiased labels via randomized slots in production (expensive, small scale), use inverse propensity weighting to reweight clicks by 1 over probability of examination (requires propensity model), or run interleaving experiments where two models' results are mixed and compared pairwise. Google and Microsoft use interleaving for search; Netflix runs randomized A/B buckets to collect ground truth for model calibration.
Another failure mode: sparse ground truth. New users or niche interest users have very few positive labels, making per user Precision@K and NDCG@K volatile. A user with 1 relevant item in the test set either gets 0.0 or 1.0 Precision@5, no middle ground. Solutions: macro average across users with minimum interaction thresholds (exclude users with fewer than 3 positives), report user coverage separately (fraction receiving at least 1 relevant item), or use smoothing techniques. Always segment metrics: new versus returning users, head versus tail content, different geographies. A global average can hide that you regressed on cold start users while improving on heavy users.
Finally, watch for metric gaming and temporal staleness. Optimizing for 30 second watch threshold can incentivize clickbait intros that cross the threshold without real satisfaction. Use multiple thresholds or negative signals (early exits, hides, dislikes) as guardrails. Evaluating on labels more than 4 weeks old overestimates performance on fresh or trending content because user interests shift. Use recent windows and stratify by content age. At scale, expect that 10% to 30% of offline wins fail to replicate online due to these biases, covariate shift, or seasonality. Offline metrics are for filtering bad models quickly; always validate top candidates in online A/B tests with primary business metrics (revenue, retention, time spent).
💡 Key Takeaways
•Production evaluation scale: hundreds of millions to billions of predictions per sweep, nightly distributed compute, bootstrap confidence intervals to detect sub 1% changes
•Position bias correction mandatory: clicks biased by rank, debias with inverse propensity weighting (reweight by 1 over examination probability), randomized interleaving, or unbiased logging via randomized slots
•Sparse ground truth handling: macro average across users with minimum 3 to 5 positives, exclude cold start users or report user coverage separately (fraction with at least 1 relevant), avoid volatile per user scores
•Temporal staleness: labels older than 4 weeks overestimate performance on fresh content, stratify by content age (0 to 7 days, 7 to 30 days, 30 plus days), use rolling 7 to 28 day windows
•Offline to online correlation: expect 10% to 30% of offline wins to fail online due to bias, covariate shift, seasonality, use offline for filtering then validate top 2 to 3 models in A/B tests with business metrics
•Metric gaming: optimizing 30 second watch threshold incentivizes clickbait, add negative signals (early exit, hide, dislike) and multiple engagement thresholds as guardrails
📌 Examples
YouTube ranking: nightly offline evaluation over 500 million user sessions, inverse propensity weighted NDCG@10, bootstrap CIs, filters top 5 models for online A/B test measuring watch time and CTR
Netflix: collects unbiased labels via 5% randomized A/B bucket, computes Precision@10 and catalog coverage with 7 day rolling window, excludes users with fewer than 3 plays in test period
Google search: interleaving experiments (Team Draft Interleaving) mix results from two rankers, measure preference via clicks, avoids position bias without propensity model, runs at 1% traffic scale
Spotify: stratifies offline NDCG@30 by user tenure (new 0 to 7 days, casual 7 to 90 days, core 90 plus days) and track age (new releases, catalog), detects that model improves core users but regresses new users by 3%