Multi-Stage Retrieval: BM25 as High Recall First Stage
One Algorithm Is Not Enough
BM25 is fast: it scores millions of documents in 10 to 20 ms using an inverted index. But BM25 is simplistic, relying purely on term matching with frequency weighting. It cannot understand synonyms, user intent, or contextual relevance. The best search results come from combining BM25 with more sophisticated (and slower) machine learning models in a multi stage pipeline.
The Multi Stage Pipeline Architecture
Stage 1 uses BM25 to retrieve 500 to 10,000 candidates in 10 to 20 ms. The goal is high recall: capture as many potentially relevant documents as possible, even if some irrelevant ones sneak through. BM25 excels here because it is cheap enough to score every matching document. Stage 2 applies machine learning rerankers to the candidate set. A BERT based cross encoder might score 1,000 candidates in 50 to 100 ms using GPU acceleration. These models consider semantic similarity, learned relevance signals, and cross attention between query and document. Stage 3 applies business logic: boost promoted content, filter by access permissions, diversify results, apply freshness decay. This stage adds 5 to 10 ms and ensures results match product requirements beyond pure relevance.
Why Not Skip BM25
Running a neural model over 10 million documents would take minutes, not milliseconds. At 1 ms per document, scoring 10 million requires 10,000 seconds (nearly 3 hours). BM25 reduces this to scoring only 1,000 to 5,000 pre filtered candidates, making neural reranking practical for real time search. The inverted index provides the critical narrowing: only documents containing query terms get scored at all.
The Recall Precision Trade Off
BM25 retrieves 1,000 candidates. If 50 truly relevant documents exist in your corpus, you want all 50 in those 1,000 (high recall). The reranker in Stage 2 pushes those 50 to the top 10 results (high precision). If BM25 misses relevant documents by retrieving only 40 of 50 (recall equals 80 percent), no reranker can recover the 10 missed documents. This is why Stage 1 retrieves generously: better to include noise than miss relevant results.
Tuning Candidate Count
How many candidates should Stage 1 retrieve? Too few and relevant documents get missed. Too many and Stage 2 becomes slow. Typical systems retrieve 500 to 5,000 candidates, tuned by measuring recall at K against labeled relevance data. If recall at 1000 is 95 percent but recall at 500 drops to 80 percent, the extra 500 candidates are worth the Stage 2 latency cost.