How Search Databases Use Inverted Indexes for Fast Retrieval
How Inverted Indexes Enable Fast Search
Traditional databases scan documents sequentially to find matches, requiring O(n) time where n is the total document count. With an inverted index, a search for "distributed systems" retrieves the posting list for each term, then computes the intersection to find documents containing both words. This set operation executes in milliseconds regardless of collection size because it operates on pre-computed sorted lists rather than raw documents.
Boolean queries combine posting lists using set operations: AND uses intersection, OR uses union, and NOT uses set difference. A query like python AND backend NOT junior executes three posting list operations in microseconds. The index also stores term frequencies and document lengths to compute relevance scores, ranking results by how well they match the query rather than returning unordered matches.
Analyzers and Text Processing Pipeline
Raw text passes through an analyzer pipeline before indexing: tokenization splits text into terms, lowercasing normalizes case, stemming reduces words to root forms ("running" becomes "run"), and stop word removal filters common words like "the" and "is". This preprocessing ensures queries match regardless of exact phrasing. A search for "runs" will match documents containing "running", "ran", or "runs" because all reduce to the same stem.
Analyzers also enable fuzzy matching for typos ("javascirpt" matches "javascript" with edit distance 1), synonym expansion ("laptop" matches "notebook"), and language specific processing. Different languages require different analyzers: German compounds like "Entwicklungsumgebung" need decomposition, while Chinese and Japanese need specialized tokenizers since they lack word boundaries.
Write Path and Near Real Time Visibility
Documents follow a specific write path: new documents land in an in-memory buffer, get written to an immutable segment file during refresh (typically every 1 second), then become searchable. Multiple small segments periodically merge into larger segments through background compaction, reducing file count and improving query performance. This near real time architecture means writes are not immediately visible, accepting eventual consistency within the refresh window for dramatically higher write throughput.
The segment based architecture provides crash recovery: committed segments persist to disk, while in-memory data replays from a transaction log on restart. A system handling 50,000 documents per second maintains sub-100ms query latency because writes and reads operate on different data structures. Writers append to in-memory buffers while readers search immutable segments.