Learn→Recommendation Systems→Evaluation Metrics (Precision@K, NDCG, Coverage)→6 of 6

Recommendation Systems • Evaluation Metrics (Precision@K, NDCG, Coverage)Medium⏱️ ~3 min

Candidate Retrieval vs Final Ranking Metrics

Core Concept
A/B testing is the gold standard for recommendation evaluation. Split traffic, show different models to different user groups, measure business outcomes. But recommendation A/B tests have unique challenges that make them harder than typical feature tests.
Network Effects
Recommendations create spillover effects. If group A sees trending items, those items get more engagement, which makes them trend more, affecting what group B sees. User-level randomization may not isolate treatment effects. Consider time-based splitting or geo-based splitting for cleaner measurement.
Long-term vs Short-term Metrics
Click-through rate responds fast (days). Retention responds slowly (weeks). Revenue effects may take months. A model that maximizes clicks with clickbait might hurt retention. Run experiments long enough to capture delayed effects. For major model changes, consider 4-8 week experiments with retention as primary metric.
Sample Size Challenges
Recommendation effects are often small (1-3% improvement). Detecting 1% lift with 95% confidence requires millions of impressions. Power your experiments properly. Use metrics with lower variance (clicks) for initial validation, then confirm with higher-variance metrics (revenue, retention).
❗ Interview Deep-Dive: "How do you A/B test recommendations?" is a common follow-up. Cover: (1) network effects and why user-level randomization may not work, (2) long-term metrics versus short-term proxies, (3) sample size requirements for detecting small effects. This demonstrates production experience beyond model building.

💡 Key Takeaways

✓Retrieval stage: optimize Recall@K at candidate size (K equals 200 to 1000), target 80% to 95% recall, latency 5 to 30 milliseconds via Approximate Nearest Neighbor (ANN) search

✓Ranking stage: optimize Precision@K and NDCG@K at UI size (K equals 5 to 20), latency 20 to 100 milliseconds, uses heavier models (neural networks, gradient boosting)

✓Recall versus latency tradeoff: 1000 candidates gives 92% recall in 30 ms, 500 candidates gives 85% recall in 10 ms, choose based on end to end p99 budget (typically under 150 to 250 ms)

✓Failure mode: ranking metrics look good but retrieval recall dropped from 90% to 70%, ranker cannot recover missing good items, always monitor both stages

✓End to end latency: track p50, p95, p99 separately for retrieval and ranking, p99 spike in ranking can time out requests and hurt online CTR despite strong offline NDCG

✓Production pattern: multiple retrieval strategies (collaborative filtering, content based, trending) merged, then single ranker, each retrieval path tracked for Recall@K contribution

📌 Interview Tips

1For system design: describe the metrics pipeline - retrieval recall@K (ANN quality), ranking NDCG (model quality), coverage (ecosystem health), and business metrics (CTR, conversion).

2When asked about monitoring: mention daily offline evaluation on 500M+ sessions, stratified by user segments and content types, with automated regression detection.

3For interview depth: explain the latency-recall trade-off - faster retrieval (fewer candidates) hurts Recall@K; establish SLOs balancing latency (p99<50ms) and recall (>90%).

← Back to Evaluation Metrics (Precision@K, NDCG, Coverage) Overview