What is an Inverted Index and How Does It Work?
The Book Index Analogy
Imagine a 500 page book where you need to find every mention of "database." Without an index, you read all 500 pages. With an index at the back, you look up "database" and get "pages 23, 47, 89, 156." The index inverts the relationship: instead of page to words, you get word to pages. An inverted index does the same for document collections at massive scale where the "book" might contain billions of pages.
The Two Core Components
An inverted index has two parts. First, a term dictionary containing all unique terms in your collection, stored as a hash table or B tree for O(1) or O(log n) lookups. Second, for each term, a postings list containing the document IDs where that term appears. A corpus with 10 million unique terms requires 100 to 500 MB just for the dictionary. The postings lists are much larger, often 10x to 100x the dictionary size.
Why Inverted Not Forward
Normal storage is document oriented (forward index): Document 1 contains words A, B, C. Inverted storage is term oriented: "pizza" appears in Documents 1, 5, 23. You invert the mapping from "document has terms" to "term is in documents." This transformation is what makes text search fundamentally different from relational database queries. A WHERE clause in SQL scans rows; an inverted index lookup finds relevant rows directly without scanning.
The Speed Difference
Scanning 10 million documents for a word means reading perhaps 50 GB of text, taking 30 to 60 seconds even on fast SSDs. Looking up a term in an inverted index: O(1) dictionary lookup (microseconds) plus O(k) to read k matching documents. If only 500 documents contain "pizza", you read 500 entries (perhaps 4 KB), not 50 GB. This is why search returns in 10 to 50 ms while a naive scan takes a minute.