Search & Ranking Systems • Ranking Algorithms (TF-IDF, BM25)Medium⏱️ ~3 min
How BM25 Improves TF-IDF with Saturation and Length Normalization
Best Match 25 (BM25) solves TF-IDF's two fatal flaws: unbounded term frequency growth and unfair treatment of document length. Instead of linear TF scoring, BM25 uses a saturation function controlled by parameter k1 (typically 1.2 to 2.0). The term frequency component becomes (f × (k1 + 1)) / (f + k1), where f is the raw count. This means the first occurrence contributes heavily, the second less so, and by the 10th occurrence, additional mentions barely matter. A document with 50 mentions of "kubernetes" scores only marginally higher than one with 20 mentions.
The second innovation is explicit document length normalization via parameter b (typically 0.75). BM25 adjusts the saturation threshold based on whether a document is longer or shorter than average. The denominator becomes k1 × ((1 minus b) + b × (doc_length / avg_doc_length)) + f. When b equals 1.0, you get full length penalization; b equals 0 means no length consideration. This prevents verbose documents from dominating simply by repeating terms more times in their extra words.
The complete BM25 score sums across all query terms: each term t contributes IDF(t) × (f(t,d) × (k1 + 1)) / (f(t,d) + k1 × (1 minus b + b × |d| / avgdl)). The IDF component is typically smoothed (adding 0.5 to numerator and denominator) to prevent negative or zero weights for very common terms. This formula has become the industry standard for lexical retrieval because it handles real world document distributions far better than TF-IDF with only two tunable parameters.
In production at scale, BM25 consistently outperforms TF-IDF by 15 to 30 percent in ranking metrics like Normalized Discounted Cumulative Gain (NDCG) and Mean Average Precision (MAP). Microsoft's MS MARCO benchmark uses BM25 as the standard baseline that neural rerankers improve upon. Amazon's product search uses BM25F (fielded variant) to weight product titles more heavily than descriptions, with typical settings of k1 around 1.2 and b around 0.75, tuned through A/B testing on click through rates.
💡 Key Takeaways
•Parameter k1 controls saturation: k1 of 1.2 means term frequency contribution plateaus around 2.0 regardless of repetition, while k1 of 2.0 allows slightly more growth before saturating; tune via offline metrics
•Parameter b controls length penalty: b of 0.75 (standard) means a document twice the average length needs roughly 1.6× the term frequency to score equally; b of 1.0 gives full penalization, b of 0.0 ignores length entirely
•IDF smoothing adds 0.5 to prevent common terms from getting negative weights: log((N minus df + 0.5) / (df + 0.5)) ensures even terms in 80% of documents contribute positively, avoiding ranking instability
•Tuning wins in production: moving from default k1 of 1.2, b of 0.75 to domain optimized values (e.g., k1 of 1.8, b of 0.6 for short product titles) yields 5 to 15 percent NDCG improvements in A/B tests
•BM25F extends to fielded documents: product title gets weight of 2.0, brand gets 1.5, description gets 0.5; each field has separate length normalization, critical for heterogeneous document structures like e-commerce catalogs
📌 Examples
Query "wireless keyboard" on Amazon: BM25F scores title match "Wireless Keyboard Mechanical" with k1=1.2, b=0.5 (low b for short titles) far higher than a lengthy description mentioning "wireless" 15 times, preventing verbose listings from dominating
Microsoft MS MARCO passage ranking: BM25 baseline achieves Mean Reciprocal Rank (MRR) at 10 of 0.187 on 8.8M passages; neural rerankers improve this to 0.39+ but use BM25 top 1,000 candidates as input, processed in under 10ms per query per shard
Wikipedia search with 6M articles averaging 800 words: k1=1.2, b=0.75 ensures a 200 word stub article with 3 mentions of "quantum entanglement" can outrank a 5,000 word overview mentioning it 18 times, if the stub is more focused