What is Rebalancing in Distributed Systems?

Definition
Rebalancing is the controlled redistribution of data or work across nodes when conditions change. Nodes join or leave, partitions grow too large, or traffic skews toward certain keys. The goal: restore even distribution without disrupting ongoing operations.
When Rebalancing Triggers:

Cluster topology changes (node joins, failures, scaling). Partition size limits exceeded (a partition grows beyond 10 GB). Workload skew (hot keys overload specific nodes). Administrative events (resharding, adding tenants).

The Core Challenge:

Rebalancing must improve distribution without breaking latency guarantees. If you move data too aggressively, you saturate network links and spike p99 latencies from 10ms to 100ms or worse. Move too slowly, and the system stays in a degraded state for hours.

Key Constraints in Practice:

Production systems limit concurrent moves (typically 2 per node maximum) and cap transfer rates. Elasticsearch limits shard relocation to 50-100 MB/s per shard. Moving a 50 GB shard takes ~11 minutes rather than overwhelming the cluster. The trade-off is speed versus stability.

✓ In Practice: Good rebalancing moves only ~1/N of data when adding one node to N nodes. With consistent hashing, adding one node to a 100-node cluster moves only ~1% of data, not all of it.

💡 Key Takeaways

✓Rebalancing maintains SLOs by redistributing load across nodes in response to membership changes, capacity shifts, workload skew, or administrative events

✓Minimal data movement is critical: adding one node to an N node cluster should ideally move only 1/N of total data, not trigger a full reshuffle

✓Concurrent move limits prevent resource exhaustion: production systems cap at 2 simultaneous shard relocations per node to protect disk and network bandwidth

✓Cross AZ data movement has real cost impact: moving 10 TB across AWS Availability Zones costs approximately $100 at $0.01 per GB, making placement topology economically significant

✓Speed versus stability tradeoff is fundamental: fast rebalancing (minutes) risks 10x p99 latency spikes while throttled approaches (hours) preserve SLOs but prolong imbalance

✓DynamoDB triggers rebalancing when partitions exceed 10 GB size or 3,000 read capacity units per second, typically completing adaptive rebalancing within seconds to minutes under load bursts

📌 Interview Tips

1Amazon DynamoDB with 40 partitions serving 120,000 reads per second (3,000 reads per second per partition at threshold) automatically adds partitions and redistributes keys when traffic spikes, maintaining service availability throughout

2Elasticsearch cluster moving 100 shards of 50 GB each at 75 MB per second with 2 concurrent moves per node across 50 nodes completes rebalancing in several hours while meeting p95 latency budgets

3Kafka consumer group rebalancing 500 partitions across 50 consumers originally paused all consumption for tens of seconds; cooperative rebalancing reduces pause to under 3 seconds by only moving 10% of partition assignments

← Back to Rebalancing Strategies Overview