Definition
Recommendation evaluation metrics measure how well a system predicts user preferences. Precision measures relevance of recommendations shown. Recall measures coverage of relevant items. NDCG measures ranking quality. Each answers a different question about system performance.
Why Multiple Metrics Matter
A system with high precision but low recall shows relevant items but misses many good options. High recall but low precision surfaces everything but wastes user attention on irrelevant items. NDCG adds ranking: showing a relevant item at position 1 is better than position 10. Each metric captures a different failure mode.
Offline vs Online Evaluation
Offline: Use historical data. Compute metrics on held-out interactions. Fast iteration, but assumes past behavior predicts future behavior. Online: A/B test with live users. Measure actual clicks, conversions, session duration. Ground truth, but slow and expensive. Use offline for development, online for final validation.
Business Metrics vs Model Metrics
Model metrics like NDCG optimize for prediction quality. Business metrics like revenue, retention, and engagement optimize for business outcomes. They often align but not always. A model might maximize clicks but show low-quality clickbait. Track both and investigate divergences.
💡 Key Insight: No single metric captures recommendation quality. Use a suite: Precision@K for relevance, NDCG for ranking, Coverage for catalog utilization, and online A/B tests for business impact. Optimize for the combination, not any single metric.
✓Computed as (number of relevant items in top K) divided by K, produces values between 0.0 and 1.0
✓Production K values: 5 to 10 for above the fold tiles, 10 to 20 for search first page or homepage rows, 30 for playlist style surfaces
✓Binary relevance definition examples: clicked, purchased, watched more than 30 seconds, completion rate above 50%
✓Position blind: placing best item at rank 1 versus rank 10 produces identical Precision@K score
✓Typical meaningful deltas: 0.5 to 1.0 percentage point improvements (0.215 to 0.225) drive significant business impact at billions of impressions
✓Always align K to actual UI surface: optimizing Precision@20 when users see 8 items hides regressions in visible region