Recommendation SystemsEvaluation Metrics (Precision@K, NDCG, Coverage)Medium⏱️ ~2 min

Choosing Precision@K vs NDCG@K: When to Use Each

The choice between Precision@K and NDCG@K depends on your label granularity, user behavior, and what you're willing to trade off. Use Precision@K when relevance is binary (clicked or not, purchased or not), users see only the first K items, and position within K is less critical. It's simple to compute, easy to explain to non technical stakeholders, and robust to label noise. Typical use cases: compact recommendation rows ("Top Picks for You" with 5 to 8 items), binary conversion events, or when you lack resources for graded annotations. Switch to NDCG@K when you have multiple relevance levels and users scroll or paginate, making position critical. For example, video platforms distinguish a 10 second sample from a 30 minute binge; search engines label results as perfect match, excellent, good, fair, or bad. NDCG rewards placing highly relevant items at the top and is more sensitive to ranking improvements. Google search, YouTube, and LinkedIn all use NDCG because their labels capture engagement intensity and users navigate deep into results. In practice, many teams track both. Pinterest might optimize for NDCG@20 during model development (sensitive to ordering and graded engagement), then report Precision@10 to leadership (simpler story). The risk: NDCG is more sensitive to labeling choices and position bias. If your training labels come from biased logs (items ranked higher got more clicks just because they were shown first), NDCG will overestimate improvements. You must debias with inverse propensity weighting or randomized experiments. Finally, choose K to match your UI. A homepage hero section showing 6 items should measure Precision@6 and NDCG@6, not @20. Mismatched K hides regressions in the visible region. LinkedIn feed shows about 15 items before scroll, so they track NDCG@15. Spotify's Discover Weekly has 30 tracks, so Precision@30 and NDCG@30 are both relevant, but early positions matter more, justifying NDCG's discount curve.
💡 Key Takeaways
Binary labels and compact UI (5 to 10 items): Precision@K is simpler, robust, aligns with stakeholder intuition, examples are conversion funnels or small recommendation tiles
Graded labels and scrolling UI (10 to 50 items): NDCG@K captures position effects and engagement intensity, used by Google search, YouTube, LinkedIn feeds
Sensitivity tradeoff: NDCG detects smaller ranking improvements (sub 1% deltas meaningful) but requires calibrated labels and debiasing for position bias in logs
Production pattern: optimize offline using NDCG@K during model development, report Precision@K to non technical stakeholders for clarity
K alignment critical: match K to UI surface, measuring Precision@20 when showing 8 items hides regressions in visible region
Label cost consideration: graded annotations for NDCG cost 2x to 5x more than binary labels, budget accordingly or use implicit signals like dwell time buckets
📌 Examples
YouTube homepage: optimizes NDCG@10 with watch time buckets (0s, 30s, 2min, 10min plus), serves 8 to 12 videos above fold, p99 ranking latency under 100 milliseconds
Amazon product recommendations: Precision@5 for "Customers who bought this" module with binary purchase label, compact 5 item row, conversion rate tracked online
Google search: NDCG@10 standard with 0 to 4 graded judgments, offline evaluation over millions of queries, 1% NDCG improvement tested in online A/B for CTR impact
Spotify Discover Weekly: tracks both Precision@30 (fraction played more than 30s) and NDCG@30 (completion rate bands), reports Precision to artists for transparency
← Back to Evaluation Metrics (Precision@K, NDCG, Coverage) Overview