Message Queues & StreamingConsumer Groups & Load BalancingMedium⏱️ ~3 min

Rebalancing Strategies: Round Robin vs Sticky vs Lag Aware Assignment

Rebalancing is the process of redistributing partitions when consumer group membership changes. The choice of assignment strategy fundamentally trades off fairness, stability, and resource utilization. Range and round robin assigners aim for equal partition counts per consumer but cause large scale partition movement on any membership change, creating stop the world pauses that can last seconds to minutes in large groups. Sticky and cooperative incremental rebalancing minimize disruption by reassigning only the partitions that must move. When a consumer leaves, sticky assignment revokes only its partitions and distributes them to survivors while keeping all other assignments intact. Cooperative rebalancing goes further: it allows consumers to continue processing their retained partitions during the rebalancing protocol, avoiding a full stop. LinkedIn relies on sticky assignment to keep rebalance pauses from cascading across thousands of consumer instances. In heterogeneous fleets, equal partition counts per consumer is rarely optimal. A consumer on a machine with 16 cores and fast storage can process 2000 records per second, while another on 4 cores and high latency dependencies handles only 1000 records per second. Assigning each consumer 10 partitions leaves the fast one underutilized and the slow one as a bottleneck. Agoda faced this: 5 of 40 partitions had persistent lag spikes, requiring 50% overprovisioning to meet Service Level Agreements (SLAs). Lag aware and capacity aware assignment solves this by measuring per consumer processing rates and per partition lag, then assigning partitions to equalize expected processing time rather than partition count. Agoda's lag aware consumer used incremental cooperative rebalancing to shift hot partitions to faster workers and reduced resources by 50% while eliminating chronic high lag partitions. The tradeoff is algorithm complexity, need for robust metrics pipelines, and more frequent micro rebalances that must be carefully tuned to avoid oscillation.
💡 Key Takeaways
Eager rebalancing revokes all partitions and halts all processing; with 1000 consumer groups this creates 10 to 60 second stop the world pauses that violate SLAs and cause cascading backlog.
Sticky assignment minimizes partition movement by reassigning only vacated partitions; cooperative incremental rebalancing further reduces disruption by allowing consumers to keep processing retained partitions during the protocol.
Round robin achieves fairness (equal partition count) but ignores capacity: a 2000 rec/s consumer and a 1000 rec/s consumer both get 10 partitions, wasting 50% of the fast consumer's capacity while overloading the slow one.
Lag aware assignment measures per consumer throughput and per partition lag, then assigns partitions to equalize processing time; Agoda achieved 50% resource reduction and eliminated chronic lag spikes using this approach.
Capacity weighted assignment requires robust metrics pipelines, hysteresis to avoid oscillation, and outlier detection (Interquartile Range (IQR) or standard deviation thresholds) to identify truly slow partitions versus transient spikes.
Producer side lag aware partitioning can complement consumer assignment: reduce traffic to overloaded partitions (treat lag as queue depth) to equalize backlog growth, as Agoda demonstrated with outlier detection driven partition throttling.
📌 Examples
Agoda processes hundreds of terabytes daily; one supplier generates 1.5 million price updates per minute. Lag aware consumer with cooperative rebalancing reduced resources by 50% and eliminated 5 chronically backlogged partitions.
LinkedIn uses sticky assignment across fleets with hundreds of thousands of partitions to minimize rebalance disruption and prevent stop the world pauses from cascading across thousands of consumers.
A heterogeneous fleet with 10 message/s and 20 message/s nodes using round robin requires 50% overprovisioning to hit P99 latency SLAs; capacity aware assignment equalizes utilization and eliminates the overprovision.
← Back to Consumer Groups & Load Balancing Overview