Message Queues & StreamingMessage Ordering & PartitioningHard⏱️ ~2 min

Partition Stability vs Elasticity: Handling Repartitioning

Simple modulo hashing for partition assignment is fast but unstable under resize. Using hash of key mod N to map keys to partitions means changing N (adding or removing partitions) remaps many keys to different partitions. With 10 partitions growing to 11, roughly 9 out of 10 keys change partition assignment. This disrupts consumer cache locality, forces downstream storage to handle keys from new partitions, and can create thundering herd effects as consumers scramble to rebalance. Consistent hashing with virtual nodes minimizes remapping when scaling. Instead of direct modulo, map keys to points on a ring (0 to 2^32) and assign partition ownership to ranges. Adding a partition claims only roughly 1 divided by N of the ring, remapping only that fraction of keys. Virtual nodes (each physical partition owns multiple ring positions) smooth distribution and reduce the impact of individual partition changes. LinkedIn and Amazon use consistent hashing internally for stable partition assignment in large scale systems, though the abstraction is often hidden from users. Consumer group rebalances are the operational pain point during partition changes. When partitions are added or consumers join or leave, the group coordinator triggers a rebalance to reassign partitions to consumers. Naive stop the world rebalancing pauses all consumers until reassignment completes, creating processing gaps of seconds to minutes under large groups. Cooperative rebalancing (incremental reassignment) and sticky assignment (prefer keeping existing assignments) reduce disruption but still add tail latency during transitions. Capacity planning should minimize resize frequency by starting with headroom and scaling in large increments. Prefer doubling partition count (10 to 20 to 40) rather than incremental adds (10 to 11 to 12) to amortize disruption cost. Monitor partition saturation (per partition throughput, lag, consumer utilization) and trigger expansion before hitting limits. LinkedIn publishes that they aim to keep partitions under 50 percent sustained utilization to absorb spikes without emergency resizing.
💡 Key Takeaways
Modulo hashing remaps most keys when partition count changes. hash of key mod N means changing N from 10 to 11 remaps roughly 90 percent of keys to different partitions, disrupting caches and storage locality.
Consistent hashing with virtual nodes minimizes remapping. Mapping keys to ring positions and partition ranges means adding one partition remaps only roughly 1 divided by N keys (9 percent for 11 partitions vs 90 percent with modulo).
Consumer rebalances pause processing during partition changes. Stop the world rebalancing halts all consumers for seconds to minutes; cooperative rebalancing reduces but does not eliminate disruption and tail latency spikes.
Scale in large increments to amortize disruption cost. Doubling partition count (10 to 20 to 40) spreads pain over fewer events than incremental adds (10 to 11 to 12 to 13), reducing operational toil and consumer churn.
Provision 30 to 50 percent headroom to avoid emergency resizing. LinkedIn targets under 50 percent sustained partition utilization to absorb traffic spikes without triggering frequent, disruptive repartitioning events.
📌 Examples
Amazon Kinesis stream resharding: splitting 10 shards to 20 using consistent hashing remaps only 50 percent of keys vs 95 percent with modulo; consumers using KCL (Kinesis Client Library) handle rebalance automatically with cooperative protocol
LinkedIn Kafka cluster expansion: doubling topic from 64 to 128 partitions triggers rebalance across 100+ consumer instances; cooperative rebalancing limits per consumer pause to under 5 seconds vs 30+ seconds with stop the world
Azure Event Hubs partition increase: adding partitions to existing hub does not remap existing partition keys (partition key to partition ID mapping is stable), but new keys distribute across expanded partition set
← Back to Message Ordering & Partitioning Overview
Partition Stability vs Elasticity: Handling Repartitioning | Message Ordering & Partitioning - System Overflow