Recommendation Systems • Evaluation Metrics (Precision@K, NDCG, Coverage)Medium⏱️ ~2 min
NDCG@K: Position Aware Ranking Quality
Normalized Discounted Cumulative Gain (NDCG) solves two critical problems that Precision@K ignores: position matters, and relevance isn't binary. Placing a highly relevant item at position 1 is much better than position 10, and a user watching 60 minutes of a video is more valuable than watching 2 minutes. NDCG captures both.
The mechanics: for each position i in your top K, you take the relevance gain (often 2^relevance minus 1 for graded labels) and divide by log2(i plus 1). This logarithmic discount means position 1 gets full weight, position 2 gets about 63% weight, position 10 gets about 30% weight. Sum these discounted gains to get Discounted Cumulative Gain (DCG). Then normalize by the ideal DCG (IDCG), which is what you'd get if you sorted perfectly. This puts scores in the 0 to 1 range.
Google and Bing use NDCG@K with graded labels (0 to 4 scale) as their primary offline search ranking metric, typically reporting NDCG@1, NDCG@3, and NDCG@10. Public learning to rank datasets like Microsoft Learning to Rank (MSLR) and Yahoo Learning to Rank reflect this practice. YouTube and LinkedIn track NDCG@10 to NDCG@20 for feed ranking, where small improvements of 0.5% to 2% relative NDCG can translate to measurable Click Through Rate (CTR) or time spent lifts online.
The tradeoff: NDCG requires calibrated graded labels, which cost more to collect than binary clicks. Label quality matters enormously because position bias in training data (items ranked higher get more clicks regardless of true relevance) can lead to overestimating gains. Use inverse propensity weighting or randomized interleaving experiments to debias. Also, choosing the discount function (log2 versus log10, or linear) and gain mapping (2^rel minus 1 versus just rel) affects sensitivity, so align these to your product's actual utility curve.
💡 Key Takeaways
•Combines graded relevance (0 to 4 scale common in search, or watch time buckets in video) with logarithmic position discounting: gain divided by log2(position plus 1)
•Normalization by Ideal DCG (IDCG) enables comparison across queries and users, producing scores in 0 to 1 range
•Google search standard: NDCG@1, NDCG@3, NDCG@10 with multi level judgments (0 to 4), offline evaluation over tens of millions of query document pairs
•Typical production improvements: 0.5% to 2% relative NDCG@10 gain correlates with statistically significant CTR or engagement lifts in online A/B tests
•Position bias mitigation required: clicks are biased by prior ranking, use inverse propensity weighting or randomized interleaving to get unbiased offline NDCG estimates
•Discount function choice matters: log2 is standard, but steeper (log10) or shallower (linear) discounts change sensitivity to deep rank errors
📌 Examples
Web search: Microsoft Learning to Rank (MSLR) dataset uses 5 level relevance (0 to 4), reports NDCG@1/3/10, ranking must complete in 20 to 50 milliseconds at millions of queries per second
YouTube feed ranking: NDCG@10 to NDCG@20 with watch time buckets as graded labels (0 to 30s, 30s to 2min, 2min to 10min, 10min plus), 20 to 100 millisecond ranking latency
LinkedIn feed: NDCG@15 using engagement levels (impression, click, like, comment, share) mapped to 0 to 4 scale, macro averaged across user segments
Spotify playlist ranking: NDCG@30 with completion rate bands (skipped, played under 30s, 30s to 90s, full play), nightly offline evaluation over 200 million user playlist interactions