Database Design • Search Databases (Elasticsearch, Solr)Medium⏱️ ~3 min
Sharding, Replication, and Eventual Consistency in Search Systems
Search databases scale horizontally by partitioning the index into N primary shards, each with R replicas. Shards provide write distribution and data parallelism, while replicas enable read scaling and high availability. A typical production setup might use 5 primary shards with 2 replicas each, giving 15 total shards (5 primary plus 10 replica copies) that can serve reads from any copy.
The critical tradeoff is eventual consistency at refresh granularity. When you write a document, it lands in an in memory segment on the primary shard, gets replicated to replica shards, but does not become searchable until the next refresh cycle (typically 1 second). This means strict read after write consistency is not guaranteed without special routing or forcing a refresh, which adds latency. If you write then immediately search from a different node or replica, you might not see your document for up to 1 second.
Shard sizing directly impacts performance and operations. Target 10 to 50 GB per hot shard as a general guideline: smaller shards waste resources through coordination overhead and poor cache locality, while larger shards slow down recovery, rebalancing, and increase tail latency. GitHub's legacy code search used approximately 90 nodes with 1,300 shards managing 5+ TB of data, handling low thousands of QPS with p99 latencies in the few hundred millisecond range.
Hot shards are a common failure mode where query or routing skew sends disproportionate traffic to a subset of shards. If a popular tenant or time bucket creates a hot shard, that shard's p99 latency can spike from 10ms to 500ms while the rest of the cluster sits idle. Mitigations include balanced routing keys, per tenant indexes for large tenants, adaptive replica selection, and query admission control.
💡 Key Takeaways
•Sharding partitions the index into N primary shards for write distribution, with R replicas per shard for read scaling and availability
•Eventual consistency at refresh granularity means writes may not be visible for up to 1 second without forcing refresh, which trades off latency
•Target 10 to 50 GB per hot shard to balance coordination overhead, cache locality, recovery speed, and tail latency
•Hot shard problem occurs when skewed traffic overloads specific shards, driving p99 from 10ms to 500ms while other shards remain idle
•Replicas serve reads from any copy, enabling horizontal read scaling but requiring fan out to all relevant shards for complete results
•GitHub code search example: approximately 90 nodes, 1,300 shards, 5+ TB data, handling low thousands QPS with few hundred millisecond p99
📌 Examples
Production topology: 5 primary shards at 30 GB each with 2 replicas gives 15 total shards serving 150 GB data, can handle primary failure without downtime
Hot shard scenario: Celebrity user in multi tenant index causes 80% of queries to hit one shard, latency spikes to 500ms while other shards idle at 10ms
Eventual consistency: User updates profile at time T, refresh happens at T plus 1 second, search from different node at T plus 0.5 seconds may not show update