Recommendation Systems • Retrieval & Ranking PipelineMedium⏱️ ~3 min
Multi Source Retrieval: Combining Multiple Candidate Generators
Production systems rarely rely on a single retrieval method because no single approach covers all failure modes. Sparse lexical search (BM25) excels at exact term matching but misses paraphrases. Dense semantic search captures meaning but can retrieve semantically similar yet irrelevant items (polysemy, entity confusion). Graph based retrieval surfaces related items through connections but requires explicit edges. The solution is multi source retrieval: run multiple complementary candidate generators in parallel and fuse their results.
Each generator targets a different signal with strict per generator latency budgets. Sparse lexical might return 500 candidates in 5 milliseconds. Dense ANN over user and item embeddings returns 1000 candidates in 8 milliseconds. Graph neighbors (users you follow, items co engaged) contribute 300 more in 3 milliseconds. Trending and popularity pools add another 200 as fallback. After deduplication, you might have 1500 to 2000 unique candidates ready for ranking.
The critical challenge is fusion and calibration. Scores from different generators are not comparable: a BM25 score of 8.5 means nothing next to a cosine similarity of 0.72. You must normalize scores (z score standardization or isotonic calibration per source) before merging. You also need per source quotas to prevent one generator from dominating. Pinterest applies this pattern with PinSage graph embeddings, personalized embeddings, and lexical search all contributing candidates. LinkedIn Galene similarly fuses inverted index results, ANN semantic search, and business rule pools.
The payoff is robustness and coverage. If dense retrieval fails on a tail query, lexical can save it. If a new item has no embeddings yet, popularity or content features still surface it. The cost is complexity: you now maintain multiple indexes, tune multiple systems, and debug cross generator interactions. Multi source retrieval is standard at scale because the coverage and quality gains outweigh the operational overhead.
💡 Key Takeaways
•Each generator optimizes for different signals: sparse lexical for exact terms (BM25), dense semantic for meaning (embeddings), graph for relationships (follow graph, co engagement), and heuristics for business rules (trending, subscriptions)
•Per generator budgets are strict: each must return candidates in 1 to 10 milliseconds. Pinterest reported running multiple generators in parallel with aggregate retrieval producing 500 to 10,000 candidates before ranking.
•Score normalization is mandatory because raw scores are incomparable across methods. Use z score normalization or isotonic calibration per source, then apply per source caps (quotas) to prevent one generator from monopolizing the candidate set.
•Multi source retrieval increases recall by 15 to 30% in production systems by hedging single method failure modes, but adds indexing cost and fusion complexity
•Cold start mitigation: new items lack collaborative signals, so content based and graph proximity generators provide coverage until engagement data accumulates
📌 Examples
Pinterest homefeed: PinSage graph embeddings contribute 800 candidates (related pins via graph walks), personalized two tower model adds 700 (user-pin similarity), and lexical search adds 300 (keyword matches). After deduplication, 1500 unique candidates are re ranked by a deep neural network optimizing engagement.
LinkedIn job recommendations: Inverted index returns 400 jobs matching skills and titles (5ms), ANN over job and member embeddings returns 600 (8ms), and application graph (jobs applied by similar members) adds 200 (3ms). Scores are z normalized per source before fusion.
RAG hybrid retrieval: BM25 retrieves top 50 chunks by keyword overlap (12ms), dense retriever (sentence BERT) fetches top 50 by semantic similarity (18ms). Union gives 80 unique chunks after deduplication. Cross encoder re ranks all 80 and selects top 5 for the language model, improving answer accuracy 22% over dense only.