Search & Ranking SystemsRanking Algorithms (TF-IDF, BM25)Medium⏱️ ~3 min

Multi-Stage Retrieval: BM25 as High Recall First Stage

Production search systems rarely rely on a single ranking algorithm. Instead, they use a multi stage retrieval pipeline where BM25 serves as the fast, high recall first stage that retrieves 500 to 10,000 candidate documents in 5 to 20 milliseconds per shard. This candidate set feeds into progressively more expensive reranking stages using machine learned models, with the entire pipeline completing within 150 to 300 milliseconds at the 95th percentile. This architecture balances recall, precision, and latency at scale. The first stage must scan a potentially massive corpus using an inverted index with dynamic pruning techniques like Weak AND (WAND) or Block Max WAND (BMW). These algorithms maintain per block upper bounds on term contributions, allowing the retriever to skip postings that cannot possibly enter the top k results. For a corpus of 100 million documents with 10 query terms, naive scoring would evaluate 100M documents, but WAND typically evaluates only 10,000 to 50,000 documents to find the top 1,000 candidates, achieving sub 10ms latency on modern hardware. Subsequent stages add complexity and precision. Stage two might apply a gradient boosted decision tree with 100+ features including click through rate, dwell time, document freshness, and domain authority, scoring the top 1,000 candidates in 10 to 30 milliseconds. Stage three could use a transformer based cross encoder reranker on the top 50 to 200 documents, taking 20 to 50 milliseconds on a GPU with batching. Each stage trades latency for quality, with the overall budget carefully allocated. Google and Bing follow this pattern at web scale: their first stage lexical retrieval fans out to hundreds or thousands of shards, each returning local top k results in under 20ms, then aggregates and reranks centrally. Amazon product search uses BM25F for stage one across 10 to 50 shards of their 100M+ SKU catalog, returning 500 to 5,000 candidates per query in 10 to 30ms at p95, then applies business logic and learned rerankers within a 150 to 250ms total budget. The key insight: BM25 provides the necessary recall at a latency and resource cost that makes subsequent expensive reranking feasible.
💡 Key Takeaways
Latency budget allocation is critical: first stage BM25 typically consumes 10 to 20 percent of total latency budget (20 to 40ms of 200ms), allowing 80 to 90 percent for reranking where quality gains are highest
Recall vs precision tradeoff: BM25 optimizes for recall (retrieving 95%+ of relevant documents in top 1,000), while neural rerankers optimize precision (ordering those 1,000 correctly); missing relevant docs in stage one is unrecoverable
Dynamic pruning (WAND/BMW) is essential: without it, scoring 100M documents takes seconds; with it, evaluating 10K to 50K candidate documents via upper bound pruning achieves sub 10ms latency on commodity hardware
Candidate set size tuning: too few candidates (top 100) risks recall loss if stage one misses relevant documents; too many (top 10,000) wastes reranking budget; typical sweet spot is 500 to 2,000 depending on corpus size
Sharding scales linearly: 100M documents across 20 shards means each shard handles 5M documents and returns top 50 locally in ~8ms; aggregator merges 20 × 50 = 1,000 candidates in ~2ms; latency determined by slowest shard (tail latency)
📌 Examples
Bing web search: first stage retrieval fans out to 1,000+ shards, each returning top 10 to 20 results in under 15ms using block max pruning; central aggregator merges ~15,000 candidates, reranks top 1,000 with learned model in 30 to 50ms, then applies neural reranker to top 50
Retrieval Augmented Generation (RAG) for LLMs: BM25 over 10M passages on single node returns top 200 in 5 to 10ms using in memory inverted index; cross encoder reranker scores 200 candidates on GPU in 15 to 30ms to select final 5 to 20 passages for context, total latency under 50ms
Amazon product search: 100M SKUs sharded across 50 nodes; BM25F with title weight 2.0, brand 1.5, description 0.5 returns top 100 per shard in 12ms p95; aggregator receives 5,000 candidates, applies business rules (inventory, margins) and LambdaMART reranker in 40ms
← Back to Ranking Algorithms (TF-IDF, BM25) Overview
Multi-Stage Retrieval: BM25 as High Recall First Stage | Ranking Algorithms (TF-IDF, BM25) - System Overflow