Sharding, Replication, and Eventual Consistency in Search Systems
Sharding for Horizontal Scale
Search databases partition indexes into shards (independent index fragments) distributed across nodes. Each shard contains a subset of documents and operates as a self-contained search engine. When a query arrives, it fans out to all relevant shards in parallel, each shard computes local top-k results, then a coordinator node merges these partial results into the final response. This scatter-gather pattern enables linear scaling: doubling shard count doubles indexing throughput and reduces per-shard query load.
Shard count is typically fixed at index creation because redistributing data requires full reindexing. The critical sizing decision targets 10 to 50 GB per shard. Smaller shards waste resources through coordination overhead and poor cache locality. Larger shards slow recovery (a failed 100 GB shard takes hours to rebuild) and increase tail latency (the slowest shard determines response time). A 500 GB index might use 10 primary shards at 50 GB each.
Replication for Availability and Read Scaling
Each primary shard has R replica copies stored on different nodes. Replicas serve two purposes: high availability (if a node fails, replicas promote to primary) and read scaling (queries can execute on any replica). A typical production setup uses 1 or 2 replicas per primary. With 5 primary shards and 2 replicas, you have 15 total shard copies. Queries route to the least loaded replica using adaptive selection algorithms that track per-node latency and queue depth.
Writes always go to the primary shard first, then replicate to replicas synchronously before acknowledgment. This ensures durability: a write confirmed by the system exists on at least R + 1 copies. However, synchronous replication adds write latency proportional to the slowest replica. Production systems balance this by tuning replication factors based on durability requirements versus write performance needs.
Eventual Consistency at Refresh Granularity
The critical trade off is eventual consistency at refresh granularity. When you write a document, it lands in an in-memory buffer on the primary, replicates to replica buffers, but becomes searchable only after the next refresh (typically 1 second). A user who writes then immediately searches from a different node might not see their document. This is not a bug but a deliberate design choice trading read-after-write consistency for dramatically higher throughput.
Forcing immediate visibility requires explicit refresh operations, which flush in-memory buffers to searchable segments. This adds 50 to 200ms latency per write and creates many small segments requiring expensive merges. Most applications accept the 1 second delay. Those requiring immediate consistency should route reads to the primary that received the write, or implement application level read-your-writes semantics using document timestamps.
Hot Shard Failure Mode
Hot shards occur when query or data skew sends disproportionate traffic to specific shards. If 80% of queries target documents on one shard, that shard becomes a bottleneck while others sit idle. p99 latency (the latency at the 99th percentile, meaning 99% of requests are faster) can spike from 10ms to 500ms on the hot shard while cluster average looks healthy. Tail latency (the latency of the slowest requests) determines user experience because every query must wait for the slowest shard.
Mitigations include balanced routing keys that distribute documents evenly, per-tenant indexes for large tenants generating disproportionate traffic, and admission control (rejecting queries when queue depth exceeds thresholds). Monitoring should alert on per-shard metrics, not just cluster aggregates, to catch hot spots before they cascade into outages.