Rebalancing Strategies: Round Robin vs Sticky vs Lag Aware Assignment
Assignment Strategy Fundamentals
When the group coordinator distributes partitions among consumers, it uses an assignment strategy that determines which consumer gets which partitions. The strategy choice affects rebalance behavior, processing continuity, and load distribution. The most common strategies are round-robin (distribute partitions sequentially across consumers), range (assign contiguous partition ranges to each consumer), sticky (minimize reassignment by keeping existing assignments when possible), and cooperative incremental (perform rebalancing without stopping all processing).
Eager vs Cooperative Rebalancing
Traditional eager rebalancing revokes all partition assignments before redistributing. When any consumer joins or leaves, every consumer stops processing, relinquishes all partitions, and waits for new assignments. With 1,000 consumers, this creates 10-60 second stop-the-world pauses where no processing occurs. Cooperative incremental rebalancing solves this by only revoking partitions that must move. If consumer C leaves and held partitions 5, 6, 7, only those partitions are revoked and reassigned while all other consumers continue processing their existing partitions. This dramatically reduces rebalance impact from a full stop to a partial slowdown.
Sticky Assignment Benefits
Sticky assignment preserves partition assignments across rebalances when possible. If a consumer temporarily disconnects and reconnects within the session timeout, sticky assignment returns its original partitions rather than redistributing them. This has two benefits: reduced state rebuilding (consumers often cache partition-specific state like decompression contexts or connection pools) and reduced rebalance churn (fewer partitions moving means faster rebalance completion). Without sticky assignment, even a brief network blip could shuffle all partitions, destroying cached state across the entire consumer group.
Heterogeneous Fleet Challenge
Standard assignment strategies assume all consumers have equal processing capacity. In reality, fleets often include heterogeneous machines: some consumers run on 16-core instances processing 2,000 records/sec, others on 4-core instances handling only 500 records/sec. Assigning equal partitions to each leaves fast consumers underutilized while slow consumers become bottlenecks with growing lag. The result: overall throughput is limited by the slowest consumer.
Lag-Aware and Capacity-Aware Assignment
Advanced assignment strategies measure per-consumer processing rates and per-partition lag, then assign partitions to equalize expected processing time rather than partition count. A fast consumer might receive 15 partitions while a slow consumer receives 5, achieving balanced workload. This approach can reduce required compute resources by 30-50% compared to equal assignment while eliminating chronic high-lag partitions that plague heterogeneous deployments. The trade-off is added complexity in measuring capacity and lag, plus the need for custom assignment logic.