What is Term Frequency Inverse Document Frequency (TF-IDF)?
The Newspaper Analogy
Imagine rating how well an article covers "climate change". An article mentioning "climate" 20 times seems relevant. But if every article in your newspaper mentions "climate" (a climate focused publication), that word does not distinguish articles. Meanwhile, "permafrost" appearing 3 times in one article and nowhere else is a strong signal that article specifically covers permafrost thawing. TF-IDF captures this intuition mathematically.
Term Frequency (TF)
TF measures how often a term appears in a specific document. A document mentioning "pizza" 10 times is probably more about pizza than one mentioning it once. Raw TF is just the count, but log scaled TF using 1 + log(tf) is common to reduce impact of extreme repetition. A document with 100 mentions should not score 100x higher than one with 10 mentions. Log scaling compresses this to roughly 3x difference.
Inverse Document Frequency (IDF)
IDF measures how rare a term is across the entire corpus. The formula is typically log(N/df) where N is total documents and df is how many documents contain the term. In a 10 million document corpus, "the" appears in 9.9 million documents (IDF approximately 0.004), while "permafrost" appears in 500 documents (IDF approximately 4.3). The ratio between these IDF values is over 1000x, which is why rare terms dominate scoring.
The Multiplication
TF-IDF equals TF times IDF. A word scores high only if it appears frequently in THIS document AND is rare overall. "The" appearing 50 times: high TF (log scaled to approximately 2.7), near zero IDF (0.004), final score approximately 0.01. "Permafrost" appearing 3 times: moderate TF (approximately 1.5), high IDF (4.3), final score approximately 6.5. The rare, repeated term dominates by 650x.
TF-IDF Limitations
TF-IDF has two major limitations. First, no document length normalization: a 10,000 word blog post with 50 mentions outranks a focused 500 word tutorial with 5 mentions, even if less relevant. Second, linear TF assumption enables keyword stuffing: repeat a term 100 times, score 100x higher. BM25 fixes both issues and replaced TF-IDF as the default in Lucene and Elasticsearch in 2017.