Search & Ranking Systems • Ranking Algorithms (TF-IDF, BM25)Easy⏱️ ~2 min
What is Term Frequency Inverse Document Frequency (TF-IDF)?
Term Frequency Inverse Document Frequency (TF-IDF) is a statistical measure that evaluates how relevant a word is to a document within a collection of documents. It multiplies two components: Term Frequency (TF), which counts how often a term appears in a document, and Inverse Document Frequency (IDF), which measures how rare that term is across the entire corpus. If a term appears frequently in one document but rarely across all documents, it gets a high TF-IDF score, signaling strong relevance.
The math is straightforward: TF-IDF(term, doc) = TF(term, doc) × IDF(term). Term frequency is simply the count of occurrences. Inverse document frequency is calculated as log(total_documents / documents_containing_term), so common words like "the" get low IDF values while rare technical terms get high values. This elegant simplicity made TF-IDF the foundation of information retrieval for decades.
The core assumption is that relevance scales linearly with repetition. If "kubernetes" appears 5 times, the document is 5× more relevant than if it appeared once. This works reasonably well for many cases, but breaks down with keyword stuffing or verbose documents. A 10,000 word document mentioning "kubernetes" 50 times will dominate a focused 200 word article mentioning it 3 times, even if the short article is more relevant.
In production, TF-IDF has been largely superseded by BM25 because it handles these edge cases poorly. Google's early PageRank combined TF-IDF signals with link analysis. Modern systems use TF-IDF primarily for baseline comparisons or in lightweight applications where implementation simplicity matters more than retrieval quality. It remains valuable for understanding the foundation of lexical search before moving to more sophisticated algorithms.
💡 Key Takeaways
•Linear term frequency assumption means 10 occurrences score exactly 10× higher than 1 occurrence, which overweights repetition and enables keyword stuffing
•Common words like "the" or "and" get near zero IDF values (appearing in millions of documents), while rare technical terms get high IDF (appearing in hundreds), naturally filtering stopwords
•No document length normalization means a 10,000 word blog post with 50 mentions of "kubernetes" will outrank a focused 500 word tutorial with 5 mentions, even if less relevant
•Computational cost is minimal: O(query_terms) lookups in the inverted index plus simple multiplication, enabling sub millisecond scoring on single machines with millions of documents
•Production pitfall: corpus statistics (IDF values) must be recomputed when adding documents, or new documents get inconsistent scoring; this is expensive at scale with billions of documents
📌 Examples
Search query "machine learning tutorial": A 5,000 word article mentioning "machine" 40 times and "learning" 35 times scores TF-IDF of ~150, while a concise 800 word tutorial mentioning each 8 times scores only ~32, despite potentially being more useful
Google's early search (pre 2000s) combined TF-IDF with PageRank: TF-IDF handled relevance matching while PageRank added authority signals, but pure TF-IDF was quickly replaced by more sophisticated scoring
Simple document clustering: Calculate TF-IDF vectors for 100,000 news articles, then use cosine similarity to group similar articles; works well because IDF naturally emphasizes distinctive terms like "brexit" or "tensorflow" over common words