What is Term Frequency Inverse Document Frequency (TF-IDF)?

Definition
TF-IDF (Term Frequency Inverse Document Frequency) scores how relevant a word is to a document by combining two intuitions: words that appear often in THIS document matter (Term Frequency), but words that appear everywhere are less meaningful (Inverse Document Frequency).
The Newspaper Analogy
Imagine rating how well an article covers "climate change". An article mentioning "climate" 20 times seems relevant. But if every article in your newspaper mentions "climate" (a climate focused publication), that word does not distinguish articles. Meanwhile, "permafrost" appearing 3 times in one article and nowhere else is a strong signal that article specifically covers permafrost thawing. TF-IDF captures this intuition mathematically.
Term Frequency (TF)
TF measures how often a term appears in a specific document. A document mentioning "pizza" 10 times is probably more about pizza than one mentioning it once. Raw TF is just the count, but log scaled TF using 1 + log(tf) is common to reduce impact of extreme repetition. A document with 100 mentions should not score 100x higher than one with 10 mentions. Log scaling compresses this to roughly 3x difference.
Inverse Document Frequency (IDF)
IDF measures how rare a term is across the entire corpus. The formula is typically log(N/df) where N is total documents and df is how many documents contain the term. In a 10 million document corpus, "the" appears in 9.9 million documents (IDF approximately 0.004), while "permafrost" appears in 500 documents (IDF approximately 4.3). The ratio between these IDF values is over 1000x, which is why rare terms dominate scoring.
The Multiplication
TF-IDF equals TF times IDF. A word scores high only if it appears frequently in THIS document AND is rare overall. "The" appearing 50 times: high TF (log scaled to approximately 2.7), near zero IDF (0.004), final score approximately 0.01. "Permafrost" appearing 3 times: moderate TF (approximately 1.5), high IDF (4.3), final score approximately 6.5. The rare, repeated term dominates by 650x.
TF-IDF Limitations
TF-IDF has two major limitations. First, no document length normalization: a 10,000 word blog post with 50 mentions outranks a focused 500 word tutorial with 5 mentions, even if less relevant. Second, linear TF assumption enables keyword stuffing: repeat a term 100 times, score 100x higher. BM25 fixes both issues and replaced TF-IDF as the default in Lucene and Elasticsearch in 2017.

💡 Key Takeaways

✓TF-IDF equals TF times IDF: high score requires frequent term in THIS document AND rare term across corpus

✓Log scaled TF using 1 + log(tf) compresses extreme repetition; 100 mentions scores roughly 3x more than 10, not 100x

✓IDF equals log(N/df); the appearing in 9.9M of 10M docs has IDF 0.004; permafrost in 500 docs has IDF 4.3, a 1000x ratio

✓No length normalization: long documents naturally have more term occurrences, biasing rankings toward verbose content

✓TF-IDF was default until Lucene 6.0 in 2017; BM25 became default due to better length normalization and TF saturation

📌 Interview Tips

1Walk through the math: permafrost appearing 3 times with IDF 4.3 scores approximately 6.5. The appearing 50 times with IDF 0.004 scores approximately 0.01. Rare relevant term wins.

2Explain the length bias: 10K word post with 50 mentions beats focused 500 word tutorial with 5 mentions. This is why BM25 added length normalization.

← Back to Ranking Algorithms (TF-IDF, BM25) Overview