ML-Powered Search & Ranking • Evaluation (NDCG, MRR, CTR, Dwell Time)Medium⏱️ ~3 min
Offline Ranking Metrics: NDCG and MRR
Offline metrics let you evaluate ranking quality without live traffic using static relevance judgments or historical logs. They are fast, reproducible, and cheap to iterate on, making them essential for model development before expensive online testing.
Normalized Discounted Cumulative Gain (NDCG) measures how much utility users get from the top k results when each item has a graded relevance score (typically 0 to 3). The key insight is that position matters: moving a highly relevant item from rank 10 to rank 1 helps much more than moving it from rank 100 to rank 90. NDCG applies logarithmic discounting, dividing each item's relevance gain by log2(position + 1). The score is then normalized by the ideal ranking to give values between 0 and 1. At Google scale, engineers expect NDCG@10 improvements of 0.5 to 1.5 percent on head queries to be meaningful, while gains below 0.2 percent rarely translate to online improvements.
Mean Reciprocal Rank (MRR) focuses solely on the position of the first relevant item, computing 1 divided by that rank and averaging across queries. If the first relevant result appears at position 3, that query contributes 0.33 to MRR. This metric shines for single answer tasks like question answering, knowledge panel lookups, or navigation queries where users need one good result. The tradeoff is clear: MRR ignores everything after the first hit, so if users typically browse multiple results or consume several recommendations, MRR misses valuable improvements deeper in the list.
In practice, teams at Google label millions of query document pairs per month with graded relevance scores from human raters. Models are evaluated on holdout time windows, often stratified by query type (informational, navigational, transactional), and teams track correlation between offline gains and online outcomes to set launch gates.
💡 Key Takeaways
•NDCG uses graded relevance (0 to 3) and logarithmic position discounting, ideal for multi intent queries where users consume multiple results like recommendation carousels or broad searches
•MRR cares only about the first relevant item's position, best for single answer tasks like navigation queries or question answering where users stop after one good result
•Offline metrics require human labeling at scale: Google labels millions of query document pairs monthly, with graded relevance judgments from trained raters
•Meaningful offline gains are typically NDCG@10 improvements of 0.5 to 1.5 percent on head queries; gains below 0.2 percent rarely correlate with online improvements
•Teams maintain correlation maps between offline metric deltas and online outcomes per query segment, using these as launch gates before expensive A/B testing
📌 Examples
Google Search uses NDCG@3, NDCG@5, and NDCG@10 to match typical viewport sizes, stratifying by query intent (informational, navigational, transactional) since label distributions vary
Amazon Search applies MRR for product lookup queries where users want a specific item, but uses NDCG for browse queries where users compare multiple products
LinkedIn stratifies offline evaluation by member segment and computes bootstrap confidence intervals over queries (not documents) since queries are the unit of analysis