Failure Modes: Rebalance Storms, Hot Partitions, and Recovery Strategies

Rebalance Storms
Rebalance storms occur when eager rebalancing (the strategy that revokes all partitions before redistributing) cascades across a large consumer group. With 1,000 consumers, a single consumer crash triggers revocation of all partitions from all consumers, halting processing for 10-60 seconds while the coordinator redistributes. During this window, no records are processed and lag grows. Triggers include: flaky networks causing heartbeat failures, GC (garbage collection) pauses exceeding session timeout (commonly 10 seconds), or thundering herd effects when mass-restarting consumers during deployments. Mitigate with cooperative incremental rebalancing (which only revokes partitions that must move) and staggered consumer restarts spaced 5-10 seconds apart.
Hot Partitions
Hot partitions occur when a small fraction of partition keys dominates traffic. If one user or entity generates 10% of all events, their partition backs up while others remain nearly empty. Symptoms: sustained lag isolated to specific partitions while others have zero lag; one consumer at 100% CPU while others sit idle. Solutions include: key space sharding (append a random suffix or sequence number to hot keys, spreading their load across multiple partitions at the cost of losing per-key ordering), dynamic repartitioning (manually move hot keys to dedicated partitions), or producer-side load shaping (rate-limit events from hot keys or batch them).
Slow Consumers and Retention Overflow
When consumer throughput falls below producer throughput for longer than the retention period (how long records are kept before deletion), silent data loss occurs. Consider: 1 million record backlog with consumer processing at 2,000 records/sec. Time to clear: 500 seconds (~8 minutes). But if retention is only 1 hour and the consumer was down for 2 hours, the first hour of data has already been deleted before the consumer could process it. Calculate time-to-zero-lag relative to retention and alert when the ratio approaches danger levels (e.g., lag / retention_time > 0.5).
False Failures from Transient Issues
Network partitions, GC pauses, or blocking I/O cause consumers to miss heartbeats, triggering unnecessary rebalances. A consumer experiencing a 15-second GC pause misses heartbeats and is declared dead, triggering rebalance. The consumer recovers immediately but its partitions have already been reassigned. Tune session timeout above your P99 GC pause duration (typically 30-60 seconds) while keeping heartbeat interval short (3-5 seconds) for timely failure detection. Monitor rebalance frequency: more than 1-2 per hour suggests configuration problems or infrastructure instability.
Key Trade-off: Session timeout balances failure detection speed against false positive rate. Short timeouts detect real failures quickly but trigger false rebalances on GC pauses. Long timeouts tolerate transient issues but delay detection of actual crashes. Set timeout above P99 GC pause with short heartbeat interval.

💡 Key Takeaways

✓Rebalance storms: eager rebalancing revokes all partitions, causing 10-60s pauses; use cooperative rebalancing and staggered restarts

✓Hot partitions: small key fraction dominates traffic, backing up one partition; shard keys with random suffix to spread load

✓Retention overflow: if lag exceeds retention period, oldest unprocessed records are deleted (silent data loss); monitor lag/retention ratio

✓False failures: GC pauses cause missed heartbeats; set session timeout above P99 GC (30-60s) with short heartbeat interval (3-5s)

📌 Interview Tips

1Calculate retention risk: 1M backlog at 2,000/sec = 500s to clear; if retention is 1 hour and consumer down 2 hours, first hour of data is lost

2Hot partition pattern: one celebrity user generates 10% of events; their partition has 10x the lag; append random 0-9 suffix to spread across 10 partitions

3Tuning session timeout: P99 GC is 12 seconds; set session timeout to 45 seconds; heartbeat every 3 seconds; detects real failures within 45s, ignores GC pauses

← Back to Consumer Groups & Load Balancing Overview