ML-Powered Search & RankingDense Retrieval (BERT-based Embeddings)Hard⏱️ ~3 min

Hybrid Retrieval: Combining Dense and Sparse Methods

Core Concept
Hybrid retrieval combines dense semantic search with sparse keyword matching. The intuition: dense excels at synonyms and paraphrases; sparse excels at rare terms and exact matches. Together they cover more relevant documents than either alone.

Fusion Strategies

Score fusion: Normalize scores from each retriever, combine with weights: score = α × dense_score + (1-α) × sparse_score. Typical α: 0.5-0.7. Rank fusion: Merge by rank position (RRF - Reciprocal Rank Fusion). More robust to score scale differences. Cascade: Use sparse for initial recall, dense for re-ranking. Reduces dense inference cost.

Why Hybrid Outperforms

Dense models miss queries with rare terms (product IDs, technical jargon, proper nouns not in training data). Sparse models miss semantic matches ("inexpensive" for "cheap"). Hybrid catches both. Empirically: hybrid improves recall@100 by 5-15% over either method alone. The improvement is largest on mixed query types; homogeneous query sets may see smaller gains.

Implementation Considerations

Run both retrievers in parallel to minimize latency. Normalize scores before fusion (dense and sparse scores are on different scales). Tune fusion weights on held-out data; optimal α varies by domain. For cascade, dense re-ranks sparse top-100 or top-200; larger candidate sets improve recall but increase cost.

⚠️ Trade-off: Hybrid doubles infrastructure complexity and cost. Justify with measured recall gains on your specific query distribution.
💡 Key Takeaways
Hybrid combines dense (synonyms) and sparse (rare terms); together covers more than either alone
Fusion methods: score fusion (α weighting), rank fusion (RRF), cascade (sparse → dense re-rank)
Typical α: 0.5-0.7 for score fusion; tune on held-out data per domain
Hybrid improves recall@100 by 5-15% over single methods; largest gains on mixed query types
Run retrievers in parallel; normalize scores before fusion (different scales)
📌 Interview Tips
1Explain the score fusion formula with typical α values (0.5-0.7)
2Describe why hybrid outperforms with specific examples (rare terms, synonyms)
3Mention RRF as robust alternative to score fusion when scales differ
← Back to Dense Retrieval (BERT-based Embeddings) Overview
Hybrid Retrieval: Combining Dense and Sparse Methods | Dense Retrieval (BERT-based Embeddings) - System Overflow