Building and Scaling Inverted Indexes: The Indexing Pipeline

The Indexing Pipeline
Raw text cannot be directly added to an inverted index. It must pass through a pipeline that normalizes and tokenizes text. For a corpus of 10 million documents averaging 1,000 words each, the pipeline processes 10 billion tokens, and small decisions affect both index size and search quality.
Tokenization
Tokenization splits text into individual tokens. "The quick brown fox" becomes ["The", "quick", "brown", "fox"]. Is "New York" one token or two? Standard tokenizers split on whitespace, producing two. Phrase aware tokenizers recognize multi word entities. A typical English document produces 150 to 200 tokens per 1,000 characters. CJK languages without whitespace need specialized n gram or dictionary tokenizers.
Normalization and Case Folding
Normalization converts text to canonical form: lowercase ("The" becomes "the"), accent removal ("cafe" becomes "cafe"), Unicode normalization. A user searching "PIZZA" finds documents containing "pizza". Case folding alone reduces dictionary size by 30 to 40 percent. Without it, you need separate entries for Pizza, pizza, and PIZZA.
Stemming and Lemmatization
Stemming reduces words to roots. "running", "runs", "ran" become "run" with Porter stemmer. Now searching "running" finds "runs". Stemming reduces dictionary by 20 to 30 percent. But aggressive stemming causes false matches: "universe" and "university" might merge to "univers". Lemmatization uses vocabulary lookup ("better" becomes "good") but adds 5 to 10 ms per document.
Stopword Handling
Stopwords like "the", "is", "and" appear in 80 to 90 percent of documents. Removing them shrinks postings by 30 to 40 percent. But "The Who" (the band) needs "The". Modern systems keep stopwords but downweight them in scoring, enabling phrase matching while minimizing noise.
Segment Architecture
After processing, each token goes to its postings list with document ID. Segments are immutable: new documents create new segments. Lucene creates segments every 1 second by default for near real time search. Background merges consolidate when segment count exceeds 10 to 20. Merging causes 5x to 10x write amplification, requiring throttling during peak query hours.

💡 Key Takeaways

✓Tokenization produces 150 to 200 tokens per 1000 chars for English; CJK languages need n gram or dictionary tokenizers

✓Case folding reduces dictionary by 30 to 40 percent; stemming reduces another 20 to 30 percent but causes false matches

✓Lemmatization uses vocabulary lookup adding 5 to 10 ms per document; stemming is faster but less accurate

✓Stopword removal shrinks postings by 30 to 40 percent; modern systems downweight instead of remove for phrase support

✓Segment merging causes 5x to 10x write amplification; throttle during peak to prevent IO contention

📌 Interview Tips

1Walk through the pipeline: raw text to tokens to normalized to stemmed to indexed. Show how each step affects index size.

2Discuss stemming tradeoffs: Porter merges running and runs for recall, but also merges universe and university incorrectly.

← Back to Inverted Index & Text Search Overview