Failure Modes: Rebalance Storms, Hot Partitions, and Recovery Strategies
Rebalance Storms
Rebalance storms occur when eager rebalancing (the strategy that revokes all partitions before redistributing) cascades across a large consumer group. With 1,000 consumers, a single consumer crash triggers revocation of all partitions from all consumers, halting processing for 10-60 seconds while the coordinator redistributes. During this window, no records are processed and lag grows. Triggers include: flaky networks causing heartbeat failures, GC (garbage collection) pauses exceeding session timeout (commonly 10 seconds), or thundering herd effects when mass-restarting consumers during deployments. Mitigate with cooperative incremental rebalancing (which only revokes partitions that must move) and staggered consumer restarts spaced 5-10 seconds apart.
Hot Partitions
Hot partitions occur when a small fraction of partition keys dominates traffic. If one user or entity generates 10% of all events, their partition backs up while others remain nearly empty. Symptoms: sustained lag isolated to specific partitions while others have zero lag; one consumer at 100% CPU while others sit idle. Solutions include: key space sharding (append a random suffix or sequence number to hot keys, spreading their load across multiple partitions at the cost of losing per-key ordering), dynamic repartitioning (manually move hot keys to dedicated partitions), or producer-side load shaping (rate-limit events from hot keys or batch them).
Slow Consumers and Retention Overflow
When consumer throughput falls below producer throughput for longer than the retention period (how long records are kept before deletion), silent data loss occurs. Consider: 1 million record backlog with consumer processing at 2,000 records/sec. Time to clear: 500 seconds (~8 minutes). But if retention is only 1 hour and the consumer was down for 2 hours, the first hour of data has already been deleted before the consumer could process it. Calculate time-to-zero-lag relative to retention and alert when the ratio approaches danger levels (e.g., lag / retention_time > 0.5).
False Failures from Transient Issues
Network partitions, GC pauses, or blocking I/O cause consumers to miss heartbeats, triggering unnecessary rebalances. A consumer experiencing a 15-second GC pause misses heartbeats and is declared dead, triggering rebalance. The consumer recovers immediately but its partitions have already been reassigned. Tune session timeout above your P99 GC pause duration (typically 30-60 seconds) while keeping heartbeat interval short (3-5 seconds) for timely failure detection. Monitor rebalance frequency: more than 1-2 per hour suggests configuration problems or infrastructure instability.