Search & Ranking SystemsRanking Algorithms (TF-IDF, BM25)Hard⏱️ ~3 min

BM25 vs Dense Retrieval: When to Use Each

BM25 and dense retrieval (embedding based semantic search) represent fundamentally different approaches to finding relevant documents, each with distinct strengths that make them complementary rather than competitive. BM25 excels at exact term matching, long tail queries, and scenarios requiring frequent index updates. Dense retrieval shines with short queries, vocabulary mismatch, and semantic understanding. Understanding when to use each, or combine them, is critical for production search systems. BM25's advantage is speed and simplicity. An inverted index with WAND pruning can scan 100 million documents in 5 to 10 milliseconds on a single shard, using 40 to 100 GB of compressed postings without positions. Updates are incremental: new documents append to the index with minimal overhead. Query "macbook pro 16 inch 2023" will precisely match products with those exact terms. The downside: BM25 cannot match "physician" to "doctor", "NYC" to "New York City", or understand that "how to fix broken pipe" is semantically similar to "repairing plumbing leaks". Dense retrieval encodes queries and documents into fixed dimensional vectors (typically 768 or 1024 dimensions using models like BERT or Sentence Transformers) and retrieves via Approximate Nearest Neighbor (ANN) search. This captures semantic similarity: "dog" and "puppy" have similar embeddings even without lexical overlap. Latency is higher: ANN search with HNSW indexes typically takes 10 to 50 milliseconds for top 100 retrieval on 10 million vectors, and index size is larger (4 to 8 bytes per dimension × 768 dimensions × 10M docs = 30 to 60 GB just for vectors, plus graph structure). Updating requires re embedding documents and index rebuilding, expensive at scale. Hybrid systems combine both: retrieve top 1,000 from BM25 and top 1,000 from dense retrieval, merge using learned weights or reciprocal rank fusion, then rerank. This captures both exact match precision and semantic recall. Google's production search, Microsoft's MS MARCO, and modern RAG systems increasingly use this pattern. The tradeoff is complexity and latency: running two retrieval systems doubles infrastructure and adds 20 to 50ms to the pipeline. Choose pure BM25 when term precision matters more than semantics (product search, legal/medical retrieval). Choose dense when queries are short and paraphrased (question answering, conversational search). Choose hybrid when quality justifies the cost.
💡 Key Takeaways
Latency comparison: BM25 with WAND achieves 5 to 10ms p95 for top 1,000 on 100M documents per shard; dense ANN search takes 10 to 50ms for top 100 on 10M vectors per shard, scaling worse with corpus size due to graph traversal costs
Index size: BM25 postings compress to 1 to 3 bytes per term occurrence (40 to 100 GB for 100M docs with 200 terms each); dense requires 768 floats × 4 bytes × 100M = 307 GB uncompressed, 80 to 150 GB with quantization, plus HNSW graph overhead
Update cost asymmetry: BM25 handles incremental updates via append only segments with periodic merges (write amplification 2 to 5×); dense requires re embedding changed documents (GPU inference cost) and index rebuilds taking minutes to hours at scale
Query length sensitivity: BM25 improves with longer queries (more terms = more signal); dense retrieval degrades because averaging embeddings of many terms dilutes semantic focus; crossover point typically 3 to 5 terms
Hybrid retrieval gains: combining BM25 and dense improves NDCG by 10 to 25 percent over either alone on benchmarks like MS MARCO, but increases p95 latency by 30 to 60ms and requires maintaining two indexes with 2× storage and infrastructure
📌 Examples
Product search at Amazon: pure BM25F preferred because queries like "samsung 65 inch 4k tv 2023" need exact term matching on specs; dense embeddings would incorrectly retrieve "LG 55 inch" as semantically similar despite wrong brand and size
Question answering for customer support: dense retrieval excels because user asks "why is my payment failing" and relevant doc says "transaction declined reasons"; BM25 misses this due to vocabulary mismatch, while embeddings capture semantic equivalence
MS MARCO document ranking: BM25 baseline achieves MRR at 10 of 0.187; pure dense retrieval achieves 0.32; hybrid (BM25 + dense fusion) achieves 0.38; neural reranker on hybrid retrieval reaches 0.44, showing complementary strengths
Retrieval Augmented Generation at scale: BM25 retrieves top 500 in 8ms, dense retrieves top 500 in 25ms, reciprocal rank fusion merges to top 200 in 2ms, cross encoder reranks to final 20 in 30ms; total 65ms vs 38ms for BM25 only, justified by 20% better answer quality
← Back to Ranking Algorithms (TF-IDF, BM25) Overview