Partitioning & ShardingRebalancing StrategiesEasy⏱️ ~2 min

What is Rebalancing in Distributed Systems?

Rebalancing is the controlled redistribution of work, data, or leadership across nodes in a distributed system to maintain Service Level Objectives (SLOs) as conditions change. Unlike initial placement which happens once at startup, rebalancing responds to dynamic triggers: nodes joining or leaving the cluster, capacity changes from scaling operations or Availability Zone (AZ) failures, workload skew from hot keys or diurnal traffic patterns, and administrative events like resharding or adding new tenants. The fundamental challenge is minimizing disruption while improving the system. When Amazon DynamoDB detects a partition exceeding 10 GB or 3,000 read capacity units, it triggers partition splitting and rebalances data across nodes. This operation must complete without violating latency SLOs, typically keeping p99 latencies stable within milliseconds of baseline. The goal is not perfection but staying within operational bounds: reduce tail latency spikes, improve utilization from 70% to 85%, and ensure no single node becomes a bottleneck. Good rebalancing strategies share common properties: they move minimal data (ideally around 1/N of total data when adding one node to N nodes), they limit concurrent operations to prevent resource exhaustion (typically 2 moves per node maximum), and they preserve availability throughout the process. At scale, these constraints matter enormously. Moving 10 TB across Availability Zones at AWS costs roughly $100 in cross AZ transfer fees at $0.01 per GB, making placement decisions economically significant. The tradeoff space is between speed and stability. Fast rebalancing completes in minutes but risks saturating network links and spiking p99 latencies by 10x or more. Throttled rebalancing protects SLOs but may take hours to converge, leaving the system in a suboptimal state longer. Production systems like Elasticsearch cap shard relocation to 50 to 100 MB per second per shard, accepting that moving a 50 GB shard takes roughly 11 minutes rather than overwhelming the cluster.
💡 Key Takeaways
Rebalancing maintains SLOs by redistributing load across nodes in response to membership changes, capacity shifts, workload skew, or administrative events
Minimal data movement is critical: adding one node to an N node cluster should ideally move only 1/N of total data, not trigger a full reshuffle
Concurrent move limits prevent resource exhaustion: production systems cap at 2 simultaneous shard relocations per node to protect disk and network bandwidth
Cross AZ data movement has real cost impact: moving 10 TB across AWS Availability Zones costs approximately $100 at $0.01 per GB, making placement topology economically significant
Speed versus stability tradeoff is fundamental: fast rebalancing (minutes) risks 10x p99 latency spikes while throttled approaches (hours) preserve SLOs but prolong imbalance
DynamoDB triggers rebalancing when partitions exceed 10 GB size or 3,000 read capacity units per second, typically completing adaptive rebalancing within seconds to minutes under load bursts
📌 Examples
Amazon DynamoDB with 40 partitions serving 120,000 reads per second (3,000 reads per second per partition at threshold) automatically adds partitions and redistributes keys when traffic spikes, maintaining service availability throughout
Elasticsearch cluster moving 100 shards of 50 GB each at 75 MB per second with 2 concurrent moves per node across 50 nodes completes rebalancing in several hours while meeting p95 latency budgets
Kafka consumer group rebalancing 500 partitions across 50 consumers originally paused all consumption for tens of seconds; cooperative rebalancing reduces pause to under 3 seconds by only moving 10% of partition assignments
← Back to Rebalancing Strategies Overview