Distributed Search: Sharding, Replication, and Fanout

When One Server Is Not Enough
A billion documents at 2 KB average equals 2 TB of raw text. With index overhead, expect 4 to 6 TB total. This exceeds RAM on any single machine. Even if it fit, one server cannot handle 10,000 QPS. You must distribute the index across machines using sharding and replication.
Sharding Strategy
Split documents across shards using one of two approaches. Hash the document ID for even distribution. Or range partition by date or category so filtered queries can target specific shards. For a billion documents across 20 shards, each holds 50 million documents and roughly 250 GB of index. Each shard maintains a complete inverted index for its subset, independent of other shards.
Query Fanout
Queries fan out to ALL shards in parallel. A coordinator receives the query, broadcasts to 20 shards, and waits for responses. Each shard searches its 50 million documents, scores candidates, and returns top K (typically 100 to 500). The coordinator merges partial results into global top K. With 20 shards returning 100 results each, coordinator sorts 2,000 results in 1 to 2 ms.
The Fanout Tax
Every query hits every shard even when most have zero matches. A rare term might match only 3 of 20 shards, but all 20 process the query. With 100 shards, 100 internal RPCs per query consume network and CPU. Tail latency compounds: if each shard has p99 of 100 ms, global p99 approaches 200 to 300 ms without hedging.
Replication
Each shard has 2 to 3 replicas on different machines. If primary fails, replica becomes primary within seconds. Replicas serve read traffic: with 3 replicas per shard, you triple read throughput. A 20 shard system with 3 replicas has 60 nodes. At 500 QPS per node, this cluster handles 30,000 QPS with failure headroom.
The Deep Pagination Problem
Merging top K from each shard has an edge case: shard A's 100th result might score higher than shard B's 1st. Solution: each shard returns more than K (often 2K) for re ranking. For deep pagination (page 100, items 1000 to 1010), retrieve top 1010 from each shard then merge. This makes deep pages expensive, which is why search UIs discourage browsing past page 10.

💡 Key Takeaways

✓Document sharding fans out to all shards; with 100 shards at p99 100 ms, global p99 spikes to 200 to 300 ms without hedging

✓Replication factor 2 to 3 provides fault tolerance and throughput; 20 shards with 3 replicas at 500 QPS per node handles 30K QPS

✓Coordinator merges partial top K from each shard in 1 to 2 ms; shards return more than K to avoid missing high scores

✓Deep pagination is expensive: page 100 needs top 1010 from each shard, merged together

✓Hedged requests (duplicate after timeout) mitigate slow shards causing tail latency

📌 Interview Tips

1Explain the fanout tax: rare term matches 3 of 20 shards but all 20 process the query. Per shard latency optimization matters at scale.

2Discuss deep pagination: page 100 needs top 1010 from each shard. This is why search engines discourage deep browsing.

← Back to Inverted Index & Text Search Overview