What is an Inverted Index and How Does It Work?

Definition
An Inverted Index flips the relationship between documents and words. Instead of asking "what words are in this document?", it lets you ask "which documents contain this word?" and get an answer in milliseconds, even across billions of documents.
The Book Index Analogy
Imagine a 500 page book where you need to find every mention of "database." Without an index, you read all 500 pages. With an index at the back, you look up "database" and get "pages 23, 47, 89, 156." The index inverts the relationship: instead of page to words, you get word to pages. An inverted index does the same for document collections at massive scale where the "book" might contain billions of pages.
The Two Core Components
An inverted index has two parts. First, a term dictionary containing all unique terms in your collection, stored as a hash table or B tree for O(1) or O(log n) lookups. Second, for each term, a postings list containing the document IDs where that term appears. A corpus with 10 million unique terms requires 100 to 500 MB just for the dictionary. The postings lists are much larger, often 10x to 100x the dictionary size.
Why Inverted Not Forward
Normal storage is document oriented (forward index): Document 1 contains words A, B, C. Inverted storage is term oriented: "pizza" appears in Documents 1, 5, 23. You invert the mapping from "document has terms" to "term is in documents." This transformation is what makes text search fundamentally different from relational database queries. A WHERE clause in SQL scans rows; an inverted index lookup finds relevant rows directly without scanning.
The Speed Difference
Scanning 10 million documents for a word means reading perhaps 50 GB of text, taking 30 to 60 seconds even on fast SSDs. Looking up a term in an inverted index: O(1) dictionary lookup (microseconds) plus O(k) to read k matching documents. If only 500 documents contain "pizza", you read 500 entries (perhaps 4 KB), not 50 GB. This is why search returns in 10 to 50 ms while a naive scan takes a minute.

💡 Key Takeaways

✓Term dictionary stores vocabulary of all unique terms; postings lists map each term to sorted document IDs, enabling sub millisecond lookups without scanning

✓Dictionary lookup is O(1) with hash table or O(log n) with B tree; postings list traversal is O(k) where k is matching document count

✓A 10 million term dictionary requires 100 to 500 MB; postings lists are typically 10x to 100x larger depending on term frequencies

✓Inverted means term to documents instead of document to terms; this trades write complexity for read efficiency

✓Scanning 50 GB of text takes 30 to 60 seconds; inverted index lookup on 500 documents reads 4 KB and returns in 10 to 50 ms

📌 Interview Tips

1Draw the two components: dictionary on left with term to pointer, postings lists on right with sorted doc IDs. Show lookup for pizza finding pointer then reading list.

2Contrast with SQL: WHERE content LIKE percent pizza percent scans all rows; inverted index jumps directly to matching documents. This is the fundamental insight.

← Back to Inverted Index & Text Search Overview