ML-Powered Search & RankingEvaluation (NDCG, MRR, CTR, Dwell Time)Easy⏱️ ~3 min

What is Ranking Evaluation and Why Simple Accuracy Fails

Definition
Ranking evaluation measures how well a system orders items by relevance. Unlike classification (right or wrong), ranking cares about order: putting the best item first matters more than putting it tenth.

Why Simple Accuracy Fails for Ranking

Classification accuracy counts correct predictions: 95% accurate means 95 of 100 predictions were right. But ranking has a different goal. Consider a search for "python tutorial". If your top 10 results contain 8 relevant items, that sounds good. But if those 8 are at positions 3-10 and positions 1-2 are irrelevant, users see garbage first and leave. Accuracy says 80%, but user experience says failure. Ranking metrics must weight position: errors at the top hurt more than errors at the bottom.

Offline vs Online Evaluation

Offline evaluation uses historical data with known relevance labels. You rank items, compare against labels, compute a score. Fast (seconds), cheap (no live traffic), reproducible (same data gives same result). Run hundreds of experiments per day. The catch: labels may be stale or biased by how previous systems collected them.

Online evaluation measures real user behavior: clicks, time spent, conversions. Slow (needs traffic), expensive (affects real users), noisy (user behavior varies). But it captures what actually matters: user satisfaction. The gap between offline and online is often 10-30%: a model that wins offline may lose online.

The Four Metrics You Need to Know

NDCG (Normalized Discounted Cumulative Gain): measures graded relevance with position discounting. Best item at position 1 scores much higher than at position 10. MRR (Mean Reciprocal Rank): measures where the first relevant result appears. Good when users want one answer. CTR (Click Through Rate): percentage of impressions that get clicked. Dwell Time: how long users spend after clicking. High CTR with low dwell suggests clickbait; high dwell suggests satisfaction.

💡 Key Takeaways
Ranking evaluation measures order quality, not just correctness. Position 1 errors matter more than position 10 errors.
Simple accuracy fails because it ignores position: 80% relevant items means nothing if positions 1-2 are irrelevant.
Offline evaluation is fast and cheap but may not reflect real user behavior. Online evaluation is slow and expensive but measures what matters.
The offline/online gap is often 10-30%: models winning offline may lose online.
Four key metrics: NDCG (graded relevance with position), MRR (first correct result), CTR (clicks), Dwell Time (engagement depth).
📌 Interview Tips
1Explain why accuracy fails with a concrete example: 8 of 10 relevant items sounds good until you realize positions 1-2 are irrelevant.
2Distinguish offline (fast, cheap, reproducible, potentially stale) from online (slow, expensive, noisy, reflects reality).
3Mention the 10-30% offline/online gap to show awareness of real-world evaluation challenges.
← Back to Evaluation (NDCG, MRR, CTR, Dwell Time) Overview
What is Ranking Evaluation and Why Simple Accuracy Fails | Evaluation (NDCG, MRR, CTR, Dwell Time) - System Overflow