Partitioning & Sharding • Rebalancing StrategiesEasy⏱️ ~2 min
What is Rebalancing in Distributed Systems?
Definition
Rebalancing is the controlled redistribution of data or work across nodes when conditions change. Nodes join or leave, partitions grow too large, or traffic skews toward certain keys. The goal: restore even distribution without disrupting ongoing operations.
✓ In Practice: Good rebalancing moves only ~1/N of data when adding one node to N nodes. With consistent hashing, adding one node to a 100-node cluster moves only ~1% of data, not all of it.
💡 Key Takeaways
✓Rebalancing maintains SLOs by redistributing load across nodes in response to membership changes, capacity shifts, workload skew, or administrative events
✓Minimal data movement is critical: adding one node to an N node cluster should ideally move only 1/N of total data, not trigger a full reshuffle
✓Concurrent move limits prevent resource exhaustion: production systems cap at 2 simultaneous shard relocations per node to protect disk and network bandwidth
✓Cross AZ data movement has real cost impact: moving 10 TB across AWS Availability Zones costs approximately $100 at $0.01 per GB, making placement topology economically significant
✓Speed versus stability tradeoff is fundamental: fast rebalancing (minutes) risks 10x p99 latency spikes while throttled approaches (hours) preserve SLOs but prolong imbalance
✓DynamoDB triggers rebalancing when partitions exceed 10 GB size or 3,000 read capacity units per second, typically completing adaptive rebalancing within seconds to minutes under load bursts
📌 Interview Tips
1Amazon DynamoDB with 40 partitions serving 120,000 reads per second (3,000 reads per second per partition at threshold) automatically adds partitions and redistributes keys when traffic spikes, maintaining service availability throughout
2Elasticsearch cluster moving 100 shards of 50 GB each at 75 MB per second with 2 concurrent moves per node across 50 nodes completes rebalancing in several hours while meeting p95 latency budgets
3Kafka consumer group rebalancing 500 partitions across 50 consumers originally paused all consumption for tens of seconds; cooperative rebalancing reduces pause to under 3 seconds by only moving 10% of partition assignments