Search & Ranking SystemsInverted Index & Text SearchMedium⏱️ ~2 min

Building and Scaling Inverted Indexes: The Indexing Pipeline

Building an inverted index involves a multi stage pipeline that transforms raw text into searchable structures. First, text undergoes tokenization and normalization: lowercasing, removing diacritics, handling punctuation. Then language specific analyzers apply stopword removal (filtering common words like "the" and "is"), stemming or lemmatization (reducing "running" to "run"), and token filters. The processed tokens become postings written to immutable, append only index segments. At scale, systems avoid monolithic indexes by using segment based architectures. New documents land in small, frequently committed segments (in memory or hot segments flushed every few seconds), enabling near real time search with visibility latency under one second to a few seconds. Background merge compaction continuously combines small segments into larger ones to reclaim space from deletions and improve query performance by reducing the number of segments to scan. Amazon retail search uses change data capture (CDC) from product catalogs to keep indexes fresh at seconds to minutes latency, ensuring price and availability updates appear quickly. The system uses versioned updates to prevent out of order writes and partial document states. Google's indexing infrastructure handles trillions of documents through massive sharding and replication, with strict per stage latency budgets requiring retrieval to return top candidates in tens of milliseconds per shard.
💡 Key Takeaways
Segment based architecture uses immutable, append only segments with background merges. Near real time indexing commits small segments every 1 to 5 seconds for freshness under 10 seconds, while batch indexing reduces merge overhead but increases staleness to minutes or hours
Merge compaction is write amplification heavy. Aggressive near real time settings can cause merge storms that spike CPU and input/output (IO), contending with query traffic. Throttle merges, schedule off peak, or isolate IO to separate volumes
Language specific analyzers are critical for recall. Chinese, Japanese, Korean (CJK) languages without whitespace need n gram or dictionary based tokenizers. German compound words and diacritics in European languages require specialized handling or you lose 20 to 40 percent recall
Deletions and updates are handled via tombstones plus re adding documents. Deleted docs stay in segments until merge compaction, wasting space and slowing queries. Frequent updates cause high merge load and index bloat
Dictionary compression using tries or finite state transducers (FST) keeps vocabularies in memory for fast prefix seeks and autocomplete support, even with millions of unique terms fitting in a few hundred megabytes
📌 Examples
Twitter Earlybird used time partitioned shards to limit tail latency for recency focused queries. Hot recent shards stayed in memory for sub 100 millisecond p99 latencies despite indexing latency under 10 seconds from tweet creation
Elasticsearch deployments typically flush in memory buffers to segments every 1 second by default, providing near real time visibility. Background merges run continuously, with configurable throttling to prevent IO starvation during query spikes
Amazon product catalog indexing uses CDC pipelines with idempotent replay based on document versions. Out of order updates are rejected to avoid exposing partial states like updated price but stale availability
← Back to Inverted Index & Text Search Overview