Search & Ranking SystemsInverted Index & Text SearchHard⏱️ ~3 min

Distributed Search: Sharding, Replication, and Fanout

At cluster scale, inverted indexes are sharded across nodes to distribute load and fit corpus size. Document based sharding assigns each document to a shard (typically via hash or range on document ID), distributing write load evenly and simplifying operations. Queries fan out to all shards in parallel, each shard returns its local top K results, and the coordinator merges these into a global top K. This minimizes bytes transferred (only top K per shard, not full postings) but forces every query to hit all shards, which amplifies tail latency and limits total throughput when any shard is slow. Replication (typically factor of 2 to 3) provides high availability and load balancing. Read replicas absorb query traffic, while writes go to primaries and replicate asynchronously or semi synchronously. Caching is critical: per shard result caches store top K for hot queries (hit rates of 30 to 60 percent common for search workloads), and filter caches store bitsets (like roaring bitmaps) for frequent filters such as "in stock" or "region equals US", avoiding repeated postings scans. Term based sharding is rare but valuable for highly skewed workloads where a few hot terms dominate query load. Each term's postings list is assigned to a specific shard, localizing hot term queries to fewer nodes and reducing fanout. However, this complicates multi term and phrase queries (which now hit multiple shards) and makes updates harder (you must route writes by term, not document). Amazon and Google generally stick with document sharding for simplicity, using aggressive pruning and caching to handle skew instead.
💡 Key Takeaways
Document sharding fans out queries to all shards, amplifying tail latency. If 10 shards each have p99 latency of 100 milliseconds, global p99 approaches 100 milliseconds times slowest shard, often spiking to 200 to 300 milliseconds without hedging or timeouts
Replication factor of 2 to 3 is standard. Higher replication improves read throughput and fault tolerance but increases storage cost and write amplification. Google uses replication plus hedged requests (send duplicate after timeout) to mitigate tail latency
Result caches for hot queries and filter caches (bitsets for frequent filters) achieve 30 to 60 percent hit rates, cutting query load by half. Cold start after redeployment or cache eviction causes latency spikes until caches warm
Term based sharding can reduce fanout for skewed workloads (trending terms hit one shard), but complicates multi term queries, phrase matching, and updates. Rarely used in practice; document sharding with pruning and caching is simpler
Scoring inconsistencies arise from per shard statistics. Rare terms common on one shard score higher there. Mitigate with background jobs computing global IDF approximations or hybrid scoring bounding per shard deviations
📌 Examples
Amazon retail search shards by product ID across hundreds of nodes, replicates 2 to 3 times per shard, and uses aggressive result caching and filter bitsets for stock and region filters to handle tens of thousands of QPS during peak shopping events
Wikipedia CirrusSearch runs multi datacenter (DC) Elasticsearch clusters with document sharding and 2 replicas per shard. Query fanout hits all shards; p95 latencies stay under 200 milliseconds for typical keyword queries when caches are warm
Twitter Earlybird used time partitioned document shards (recent tweets in hot shards) to localize recency focused queries and limit fanout. Trending token queries still hit all shards but benefited from in memory hot segments and aggressive caching
← Back to Inverted Index & Text Search Overview
Distributed Search: Sharding, Replication, and Fanout | Inverted Index & Text Search - System Overflow