Recommendation SystemsEvaluation Metrics (Precision@K, NDCG, Coverage)Medium⏱️ ~3 min

Candidate Retrieval vs Final Ranking Metrics

Modern recommendation systems use two stage architectures: retrieval generates hundreds to thousands of candidates quickly (5 to 30 milliseconds), then ranking scores and orders the top 10 to 50 items for the UI (20 to 100 milliseconds). Each stage needs different metrics because they solve different problems. Retrieval optimizes for recall: did we surface the relevant items somewhere in the candidate set? Ranking optimizes for precision and order: are the top K items we show the user actually the best ones, in the right order? For retrieval, measure Recall@K at candidate set sizes. If you retrieve 500 candidates, what fraction of the items the user will eventually engage with are in that set? A common target is 80% to 95% recall at K equal to 200 to 1000. You can't rank items you didn't retrieve, so high recall is a prerequisite. Retrieval often uses Approximate Nearest Neighbor (ANN) search (FAISS, ScaNN, HNSW) on embeddings, trading perfect recall for speed. Track the recall versus latency tradeoff: retrieving 1000 candidates might give 92% recall in 30 milliseconds, while 500 candidates gives 85% recall in 10 milliseconds. Choose based on your end to end latency budget. For ranking, switch to Precision@K and NDCG@K at UI facing K values (5 to 20 typically). The ranker sees only the candidates the retrieval stage passed, so its job is to put the best ones at the top. YouTube might retrieve 1000 video candidates in 10 milliseconds via ANN, then rank them with a neural network in 50 milliseconds and show the top 12, measuring NDCG@12. LinkedIn retrieves 500 posts via multiple retrieval strategies (connections, viral, ads), then ranks to 15 items for the feed, tracking both Precision@15 and NDCG@15. The failure mode: optimizing ranking metrics without monitoring retrieval recall. If recall drops from 90% to 70% (maybe you tightened ANN parameters to save latency), your ranker can't fix it. You'll see flat or declining NDCG@10 even if the ranker improved, because it's missing good items. Always track both: retrieval Recall@K as a prerequisite, ranking Precision and NDCG@K as final quality. Additionally, monitor end to end latency percentiles (p50, p95, p99) because a 95th percentile spike in ranking latency can time out requests and hurt online metrics despite strong offline accuracy.
💡 Key Takeaways
Retrieval stage: optimize Recall@K at candidate size (K equals 200 to 1000), target 80% to 95% recall, latency 5 to 30 milliseconds via Approximate Nearest Neighbor (ANN) search
Ranking stage: optimize Precision@K and NDCG@K at UI size (K equals 5 to 20), latency 20 to 100 milliseconds, uses heavier models (neural networks, gradient boosting)
Recall versus latency tradeoff: 1000 candidates gives 92% recall in 30 ms, 500 candidates gives 85% recall in 10 ms, choose based on end to end p99 budget (typically under 150 to 250 ms)
Failure mode: ranking metrics look good but retrieval recall dropped from 90% to 70%, ranker cannot recover missing good items, always monitor both stages
End to end latency: track p50, p95, p99 separately for retrieval and ranking, p99 spike in ranking can time out requests and hurt online CTR despite strong offline NDCG
Production pattern: multiple retrieval strategies (collaborative filtering, content based, trending) merged, then single ranker, each retrieval path tracked for Recall@K contribution
📌 Examples
YouTube: retrieves 1000 video candidates via ANN on user and video embeddings in 10 milliseconds (Recall@1000 target 90%), ranks with deep neural network in 50 milliseconds, shows top 12 (NDCG@12), end to end p99 under 150 ms
LinkedIn feed: 3 retrieval paths (connections, viral content, ads) return 500 total candidates in 20 milliseconds (Recall@500 = 0.85), ranking neural network scores in 60 milliseconds, displays top 15 (Precision@15 = 0.30, NDCG@15 = 0.45)
Pinterest: ANN retrieval of 800 pins in 15 milliseconds (ScaNN library, Recall@800 = 0.88), lightweight ranking model (XGBoost) scores in 40 milliseconds, shows top 20 in home feed (NDCG@20 = 0.40)
Spotify: retrieves 1000 track candidates via multiple embeddings (collaborative, acoustic, lyrics) in 25 milliseconds, ranker produces 50 track shortlist in 80 milliseconds, final 30 tracks for Discover Weekly (Precision@30 = 0.35, NDCG@30 = 0.48)
← Back to Evaluation Metrics (Precision@K, NDCG, Coverage) Overview