What is Ranking Evaluation and Why Simple Accuracy Fails
Why Simple Accuracy Fails for Ranking
Classification accuracy counts correct predictions: 95% accurate means 95 of 100 predictions were right. But ranking has a different goal. Consider a search for "python tutorial". If your top 10 results contain 8 relevant items, that sounds good. But if those 8 are at positions 3-10 and positions 1-2 are irrelevant, users see garbage first and leave. Accuracy says 80%, but user experience says failure. Ranking metrics must weight position: errors at the top hurt more than errors at the bottom.
Offline vs Online Evaluation
Offline evaluation uses historical data with known relevance labels. You rank items, compare against labels, compute a score. Fast (seconds), cheap (no live traffic), reproducible (same data gives same result). Run hundreds of experiments per day. The catch: labels may be stale or biased by how previous systems collected them.
Online evaluation measures real user behavior: clicks, time spent, conversions. Slow (needs traffic), expensive (affects real users), noisy (user behavior varies). But it captures what actually matters: user satisfaction. The gap between offline and online is often 10-30%: a model that wins offline may lose online.
The Four Metrics You Need to Know
NDCG (Normalized Discounted Cumulative Gain): measures graded relevance with position discounting. Best item at position 1 scores much higher than at position 10. MRR (Mean Reciprocal Rank): measures where the first relevant result appears. Good when users want one answer. CTR (Click Through Rate): percentage of impressions that get clicked. Dwell Time: how long users spend after clicking. High CTR with low dwell suggests clickbait; high dwell suggests satisfaction.