Shard Splitting, Merging, and Capacity Aware Controllers

When hash based assignment is insufficient because of workload skew or growth, systems resort to changing partition granularity itself. Shard splitting divides a hot or oversized partition into multiple smaller partitions, while merging combines underutilized partitions to reduce metadata overhead. Amazon DynamoDB automatically splits partitions when they exceed roughly 10 GB in size or sustained throughput thresholds of approximately 3,000 read capacity units or 1,000 write capacity units per second. When a celebrity user causes 80% of traffic to land on one partition, splitting that partition by key range or hash prefix distributes load across multiple nodes, reducing p99 latency from 500 milliseconds back to 10 milliseconds baseline.

The split process involves several steps that must maintain availability. First, the system selects a split point (midpoint of key range or hash space) and provisions a new target node or allocates capacity on an existing node. Second, it copies the existing data to create two child partitions while continuing to serve reads and writes on the parent partition. Third, it enters a dual write phase where writes go to both parent and children until cutover. Finally, routing updates atomically to direct traffic to children and the parent is retired. This typically completes within seconds to minutes for DynamoDB, but write amplification during dual write phase can temporarily double write costs and cache misses spike as the new partition has a cold cache.

Capacity aware controllers like LinkedIn's Apache Helix orchestrate placement across thousands of instances and tens of thousands of partitions by solving a constrained optimization problem. Helix maintains a desired state: each partition needs R replicas, replicas must be in different racks or Availability Zones for fault tolerance, no instance should exceed its CPU, memory, or disk capacity, and leaders should be evenly distributed to balance write load. The controller computes a placement plan satisfying these constraints using greedy heuristics or local search, then executes state transitions incrementally (offline to standby to leader) with limits on concurrent transitions per instance, typically capping at 2 to 5 transitions simultaneously to avoid overwhelming the instance.

The tradeoff between splitting/merging and pure rebalancing is latency versus complexity. Rebalancing moves existing partitions around but doesn't fix hot partitions; you move the hot partition to a bigger node but it still overloads that node. Splitting actually reduces load per partition but introduces operational complexity: more partitions mean more metadata, more open file handles (each partition often has dedicated commit logs or storage files), more cache fragmentation, and longer recovery times (a 10,000 partition system takes longer to rebalance than a 1,000 partition system). Elasticsearch and OpenSearch typically target shard sizes of 10 to 50 GB each as a sweet spot: large enough to amortize per shard overhead, small enough to move or replicate within tens of minutes at 50 to 100 MB per second bandwidth.

💡 Key Takeaways

✓DynamoDB automatically splits partitions when they exceed 10 GB size or approximately 3,000 read capacity units per second, distributing hot partition load across multiple nodes within seconds to minutes

✓Shard splitting fixes hot partition problems that rebalancing cannot: moving a partition receiving 80% of traffic to a new node just overloads that node, splitting divides the traffic itself

✓Split process involves dual write phase where writes go to both parent and child partitions until cutover, temporarily doubling write costs and causing cache cold start latency spikes

✓Capacity aware controllers like LinkedIn Helix manage tens of thousands of partitions across thousands of instances by solving constrained placement: replica count, rack/AZ anti affinity, instance resource limits, and leader distribution

✓Partition granularity tradeoff: 10 to 50 GB shards balance operational overhead (more shards mean more metadata, file handles, cache fragmentation) against mobility (smaller shards move faster at 50 to 100 MB per second)

✓Helix executes state transitions incrementally with concurrency limits of 2 to 5 transitions per instance simultaneously, preventing overwhelming instances during large scale rebalancing operations

📌 Interview Tips

1DynamoDB partition at 12 GB and 4,500 read capacity units with p99 latency 500 milliseconds: automatic split creates two 6 GB partitions each handling 2,250 read capacity units, latency drops to 10 milliseconds baseline

2Elasticsearch cluster with target shard size 30 GB: moving one shard at 75 MB per second takes approximately 7 minutes, moving 100 shards with 2 concurrent moves per node across 50 nodes completes in several hours

3LinkedIn Helix managing 50,000 partitions across 2,000 instances: computes placement satisfying 3 replicas per partition, different AZ per replica, no instance over 80% CPU, and even leader distribution, executes with 5 concurrent state transitions per instance

← Back to Rebalancing Strategies Overview