Building and Scaling Inverted Indexes: The Indexing Pipeline
The Indexing Pipeline
Raw text cannot be directly added to an inverted index. It must pass through a pipeline that normalizes and tokenizes text. For a corpus of 10 million documents averaging 1,000 words each, the pipeline processes 10 billion tokens, and small decisions affect both index size and search quality.
Tokenization
Tokenization splits text into individual tokens. "The quick brown fox" becomes ["The", "quick", "brown", "fox"]. Is "New York" one token or two? Standard tokenizers split on whitespace, producing two. Phrase aware tokenizers recognize multi word entities. A typical English document produces 150 to 200 tokens per 1,000 characters. CJK languages without whitespace need specialized n gram or dictionary tokenizers.
Normalization and Case Folding
Normalization converts text to canonical form: lowercase ("The" becomes "the"), accent removal ("cafe" becomes "cafe"), Unicode normalization. A user searching "PIZZA" finds documents containing "pizza". Case folding alone reduces dictionary size by 30 to 40 percent. Without it, you need separate entries for Pizza, pizza, and PIZZA.
Stemming and Lemmatization
Stemming reduces words to roots. "running", "runs", "ran" become "run" with Porter stemmer. Now searching "running" finds "runs". Stemming reduces dictionary by 20 to 30 percent. But aggressive stemming causes false matches: "universe" and "university" might merge to "univers". Lemmatization uses vocabulary lookup ("better" becomes "good") but adds 5 to 10 ms per document.
Stopword Handling
Stopwords like "the", "is", "and" appear in 80 to 90 percent of documents. Removing them shrinks postings by 30 to 40 percent. But "The Who" (the band) needs "The". Modern systems keep stopwords but downweight them in scoring, enabling phrase matching while minimizing noise.
Segment Architecture
After processing, each token goes to its postings list with document ID. Segments are immutable: new documents create new segments. Lucene creates segments every 1 second by default for near real time search. Background merges consolidate when segment count exceeds 10 to 20. Merging causes 5x to 10x write amplification, requiring throttling during peak query hours.