Recommendation Systems • Evaluation Metrics (Precision@K, NDCG, Coverage)Medium⏱️ ~2 min
Coverage Metrics: Ecosystem Health Beyond Accuracy
Coverage metrics answer who gets value and how broadly the system uses the catalog. Precision and NDCG tell you if recommendations are accurate, but a model that only recommends the top 100 popular items can have high accuracy while destroying ecosystem health. Coverage ensures you're not leaving users, creators, or inventory behind.
Item coverage (also called catalog coverage) measures the fraction of your catalog that actually gets recommended. Compute it as the count of unique items recommended across all users over a time window (say, 7 days) divided by total catalog size. Netflix might have 10,000 titles but only recommend 3,000 of them, yielding 30% item coverage. User coverage asks what fraction of users receive at least one relevant recommendation; it catches cold start problems where new or niche users get poor results. Long tail coverage tracks what share of impressions go to less popular items, often measured by exposure Gini coefficient or by bucketing items into head (top 10%), torso (next 40%), tail (bottom 50%) by historical popularity.
Spotify and Netflix actively monitor creator and artist coverage: how many distinct artists or creators receive meaningful impressions each week. Small absolute changes (1 to 3 percentage point shifts in tail exposure) can materially impact creator ecosystems and long term content supply. LinkedIn tracks coverage by content type and creator segment to avoid over concentrating distribution.
The fundamental tradeoff: pushing high propensity popular items maximizes short term Precision@K and NDCG@K but collapses coverage. Introducing diversity constraints or minimum exposure floors typically reduces headline accuracy by a small amount (1 to 3% relative) but improves retention, discovery, and creator satisfaction. Production systems use multi objective optimization or re ranking with diversity constraints, and track coverage as a guardrail alongside accuracy metrics. If accuracy improves but tail coverage drops 5 percentage points, that's often a failed experiment.
💡 Key Takeaways
•Item (catalog) coverage: unique items recommended divided by catalog size, computed over 7 to 28 day windows, typical values 20% to 60% depending on catalog size and diversity policy
•User coverage: fraction of users receiving at least one relevant item in top K, critical for detecting cold start failures in new user or niche interest segments
•Long tail coverage: share of impressions to bottom 50% of items by popularity, or exposure Gini coefficient (0 = perfect equality, 1 = one item gets everything), typical Gini 0.6 to 0.9 range
•Creator/artist coverage: number or percentage of distinct creators receiving impressions, monitored weekly, 1 to 3 percentage point tail exposure shifts materially impact creator ecosystems
•Accuracy versus coverage tradeoff: maximizing Precision@K collapses coverage, diversity constraints reduce accuracy 1% to 3% relative but improve long term retention and supply health
•Popularity collapse symptom: rising exposure Gini, declining tail impressions, stagnant discovery metrics, fixed by re ranking with diversity constraints or minimum exposure floors
📌 Examples
Netflix: tracks catalog coverage over 28 days, monitors that at least 40% of catalog receives impressions, balances with per row Precision@10 targets of 0.25 to 0.35
Spotify artist coverage: ensures 70% of artists in catalog receive at least 100 impressions per week, tail exposure (bottom 50% artists) maintained above 15% of total plays
Pinterest: long tail coverage measured as impressions to pins outside top 10% by historical engagement, target 25% to 30% of impressions to tail, prevents winner take all dynamics
YouTube: creator coverage guardrail ensures new creators (less than 1000 subscribers) receive at least 5% of total video impressions, cold start user coverage target is 60% receive relevant video in first session