How BM25 Improves TF-IDF with Saturation and Length Normalization
TF-IDF Has Two Problems
TF-IDF works but has flaws that hurt real world search quality. BM25 (Best Match 25, from the Okapi information retrieval system in the 1990s) fixes both with elegant mathematical tweaks. Understanding these fixes is essential because BM25 is the default algorithm in virtually all production search systems today.
Problem 1 Unbounded Term Frequency
In raw TF-IDF, a document mentioning "pizza" 100 times scores roughly 100x higher than one mentioning it once. But is a 100 mention document really 100x more relevant? Probably not. After 5 to 10 mentions, additional repetition adds diminishing semantic signal. Worse, keyword stuffing would dominate results. BM25 introduces saturation through parameter k1 (typically 1.2 to 2.0). The contribution asymptotically approaches a maximum. Formula: saturated_TF = (tf times (k1 + 1)) / (tf + k1). With k1 equals 1.2, tf equals 1 scores 0.92, tf equals 5 scores 2.73, tf equals 100 scores 3.17. Jump from 1 to 5 is 3x, but 5 to 100 is only 16 percent more.
Problem 2 Document Length Unfairness
A 10,000 word academic paper naturally accumulates more term occurrences than a 100 word product description. Raw TF-IDF penalizes short documents unfairly. They cannot accumulate high term frequencies even when highly relevant. BM25 normalizes by document length relative to corpus average. Parameter b (typically 0.75) controls normalization strength. With b equals 0.75, a document at half average length gets a boost equivalent to doubling its term frequencies, while a document at twice average length gets the opposite adjustment. A 100 word product description mentioning "pizza" twice now competes fairly with a 2,000 word blog post mentioning it ten times.
The Complete BM25 Formula
For each query term: score = IDF times (TF times (k1 + 1)) / (TF + k1 times (1 - b + b times docLength/avgLength)). The IDF component remains similar to TF-IDF. The TF component saturates via k1. The denominator includes length normalization via b. Default parameters (k1 equals 1.2, b equals 0.75) work well across most corpora, though domain specific tuning can improve relevance by 5 to 15 percent.
Why BM25 Dominates Search
BM25 became default in Elasticsearch 5.0 (2016), Lucene 6.0 (2017), and most production search systems. It handles messy reality of mixed document lengths, keyword stuffed spam, and varying content quality with robust behavior. Neural rerankers may add 5 to 10 percent relevance improvement on top of BM25, but BM25 remains the workhorse first stage retriever for its speed and predictability.