Search & Ranking SystemsRanking Algorithms (TF-IDF, BM25)Medium⏱️ ~3 min

How BM25 Improves TF-IDF with Saturation and Length Normalization

TF-IDF Has Two Problems

TF-IDF works but has flaws that hurt real world search quality. BM25 (Best Match 25, from the Okapi information retrieval system in the 1990s) fixes both with elegant mathematical tweaks. Understanding these fixes is essential because BM25 is the default algorithm in virtually all production search systems today.

Problem 1 Unbounded Term Frequency

In raw TF-IDF, a document mentioning "pizza" 100 times scores roughly 100x higher than one mentioning it once. But is a 100 mention document really 100x more relevant? Probably not. After 5 to 10 mentions, additional repetition adds diminishing semantic signal. Worse, keyword stuffing would dominate results. BM25 introduces saturation through parameter k1 (typically 1.2 to 2.0). The contribution asymptotically approaches a maximum. Formula: saturated_TF = (tf times (k1 + 1)) / (tf + k1). With k1 equals 1.2, tf equals 1 scores 0.92, tf equals 5 scores 2.73, tf equals 100 scores 3.17. Jump from 1 to 5 is 3x, but 5 to 100 is only 16 percent more.

Problem 2 Document Length Unfairness

A 10,000 word academic paper naturally accumulates more term occurrences than a 100 word product description. Raw TF-IDF penalizes short documents unfairly. They cannot accumulate high term frequencies even when highly relevant. BM25 normalizes by document length relative to corpus average. Parameter b (typically 0.75) controls normalization strength. With b equals 0.75, a document at half average length gets a boost equivalent to doubling its term frequencies, while a document at twice average length gets the opposite adjustment. A 100 word product description mentioning "pizza" twice now competes fairly with a 2,000 word blog post mentioning it ten times.

The Complete BM25 Formula

For each query term: score = IDF times (TF times (k1 + 1)) / (TF + k1 times (1 - b + b times docLength/avgLength)). The IDF component remains similar to TF-IDF. The TF component saturates via k1. The denominator includes length normalization via b. Default parameters (k1 equals 1.2, b equals 0.75) work well across most corpora, though domain specific tuning can improve relevance by 5 to 15 percent.

Why BM25 Dominates Search

BM25 became default in Elasticsearch 5.0 (2016), Lucene 6.0 (2017), and most production search systems. It handles messy reality of mixed document lengths, keyword stuffed spam, and varying content quality with robust behavior. Neural rerankers may add 5 to 10 percent relevance improvement on top of BM25, but BM25 remains the workhorse first stage retriever for its speed and predictability.

💡 Key Takeaways
k1 parameter (typically 1.2 to 2.0) controls TF saturation; tf equals 100 scores only 16 percent more than tf equals 5 instead of 20x more
b parameter (typically 0.75) controls length normalization; document at half average length gets equivalent of 2x TF boost
Full formula: IDF times (TF times (k1+1)) / (TF + k1 times (1 - b + b times docLen/avgLen))
Default parameters k1 equals 1.2 and b equals 0.75 work across most corpora; domain tuning can improve 5 to 15 percent
BM25 became default in Elasticsearch 5.0 and Lucene 6.0; remains first stage retriever even with neural reranking
📌 Interview Tips
1Walk through saturation: with k1 equals 1.2, tf equals 1 scores 0.92, tf equals 5 scores 2.73 (3x), tf equals 100 scores 3.17 (only 16 percent more than 5). This prevents keyword stuffing.
2Explain length normalization: 100 word description with 2 mentions competes fairly against 2000 word post with 10 mentions because the b parameter adjusts for relative length.
← Back to Ranking Algorithms (TF-IDF, BM25) Overview