Sharding and Shard Key Selection at Scale
Why Sharding Matters
A single server has finite capacity. When data or traffic exceeds what one machine can handle, sharding distributes documents across multiple servers (shards). The shard key determines which shard stores each document. A good shard key distributes data evenly and routes queries efficiently. A bad shard key creates hotspots that bottleneck the entire system regardless of cluster size.
High Cardinality Requirement
The shard key must have high cardinality (many distinct values). A key with only 3 values (like status: pending, active, closed) limits you to 3 effective shards. Adding more shards wastes capacity since documents can only land in those 3 buckets.
Monotonic keys like timestamps or auto-increment IDs create write hotspots. All new documents have the latest timestamp, so they all route to one shard (the one handling the newest range). Other shards sit idle while one shard receives 100% of writes. P99 latency (response time at 99th percentile) spikes from 10ms to 500ms during traffic bursts.
Hashed vs Range Keys
Hashing the shard key distributes writes evenly: each shard receives approximately 1/N of traffic. The tradeoff is losing range query efficiency. Querying "all orders in the last hour" with a hashed timestamp key requires scatter-gather across all shards since those orders are distributed randomly.
Composite keys like (tenantId, hashedOrderId) balance both needs. Queries scoped to one tenant route to that tenants shards (efficient). Within a tenant, the hash component spreads writes evenly (no hotspot).
Resharding Challenges
Changing the shard key after deployment is expensive. Live resharding requires migrating data chunks between shards while serving traffic, consuming I/O and CPU. Plan the shard key for 3-5 years of growth. Consider future query patterns, not just current ones.