Database DesignSearch Databases (Elasticsearch, Solr)Easy⏱️ ~2 min

How Search Databases Use Inverted Indexes for Fast Retrieval

Definition
Search databases are specialized systems optimized for full text search and analytics over large document collections. They achieve sub-100ms query latency across billions of documents using an inverted index, a data structure that maps each unique term to a posting list (a sorted array of document IDs containing that term). Instead of scanning every document sequentially, the system performs a single lookup to retrieve all matching document IDs instantly.

How Inverted Indexes Enable Fast Search

Traditional databases scan documents sequentially to find matches, requiring O(n) time where n is the total document count. With an inverted index, a search for "distributed systems" retrieves the posting list for each term, then computes the intersection to find documents containing both words. This set operation executes in milliseconds regardless of collection size because it operates on pre-computed sorted lists rather than raw documents.

Boolean queries combine posting lists using set operations: AND uses intersection, OR uses union, and NOT uses set difference. A query like python AND backend NOT junior executes three posting list operations in microseconds. The index also stores term frequencies and document lengths to compute relevance scores, ranking results by how well they match the query rather than returning unordered matches.

Analyzers and Text Processing Pipeline

Raw text passes through an analyzer pipeline before indexing: tokenization splits text into terms, lowercasing normalizes case, stemming reduces words to root forms ("running" becomes "run"), and stop word removal filters common words like "the" and "is". This preprocessing ensures queries match regardless of exact phrasing. A search for "runs" will match documents containing "running", "ran", or "runs" because all reduce to the same stem.

Analyzers also enable fuzzy matching for typos ("javascirpt" matches "javascript" with edit distance 1), synonym expansion ("laptop" matches "notebook"), and language specific processing. Different languages require different analyzers: German compounds like "Entwicklungsumgebung" need decomposition, while Chinese and Japanese need specialized tokenizers since they lack word boundaries.

Write Path and Near Real Time Visibility

Documents follow a specific write path: new documents land in an in-memory buffer, get written to an immutable segment file during refresh (typically every 1 second), then become searchable. Multiple small segments periodically merge into larger segments through background compaction, reducing file count and improving query performance. This near real time architecture means writes are not immediately visible, accepting eventual consistency within the refresh window for dramatically higher write throughput.

The segment based architecture provides crash recovery: committed segments persist to disk, while in-memory data replays from a transaction log on restart. A system handling 50,000 documents per second maintains sub-100ms query latency because writes and reads operate on different data structures. Writers append to in-memory buffers while readers search immutable segments.

💡 Key Takeaways
Inverted indexes map each term to a posting list (sorted document IDs), enabling O(1) term lookup instead of O(n) document scans for instant retrieval
Analyzers preprocess text through tokenization, stemming, and normalization so queries match regardless of exact word forms or typos
Documents write to in-memory buffers first, become searchable after refresh (typically 1 second), then merge into larger segment files
Boolean queries combine posting lists using set operations: AND for intersection, OR for union, NOT for difference, executing in microseconds
Near real time architecture accepts eventual consistency within the refresh window to achieve high write throughput with sub-100ms reads
Different languages require specialized analyzers: German needs compound decomposition, Chinese and Japanese need boundary detection
📌 Interview Tips
1When asked about search database internals, explain the inverted index as a reverse lookup table: instead of documents pointing to terms, terms point to documents
2Mention the analyzer pipeline (tokenize, lowercase, stem, filter stop words) to show you understand how raw text becomes searchable terms
3Discuss the near real time trade off: 1 second refresh delay enables 50,000 plus writes per second, versus synchronous indexing which caps at hundreds
← Back to Search Databases (Elasticsearch, Solr) Overview
How Search Databases Use Inverted Indexes for Fast Retrieval | Search Databases (Elasticsearch, Solr) - System Overflow