Production Failure Modes: Rebalance Storms, Hotspots, and ISR Shrink

Rebalance Storms:

Rebalance storms occur when consumer group membership churns: session timeouts trigger rebalances, pausing consumption for 30 to 90 seconds while the coordinator redistributes partitions. If rebalances cascade (slow consumers time out during rebalance, triggering another rebalance), the group never stabilizes and throughput craters. Use cooperative incremental rebalancing and increase session timeout to 30 to 60 seconds to absorb transient garbage collection pauses.

Partition Hotspotting:

Hotspotting arises from poor key distribution: a celebrity user generates 10,000 events per second while typical users generate 1 per minute. The celebrity partition lags behind, spiking consumer lag from seconds to minutes and blocking downstream joins or aggregations. Mitigate by salting keys or using composite keys with time buckets. Monitor per-partition throughput; alert on any partition exceeding 2x the median.

⚠️ Common Pitfall: ISR shrink and write unavailability occur when followers cannot keep pace with the leader. Once ISR drops below minimum (typically 2), quorum writes block entirely, timing out producers. This cascades: blocked producers retry, increasing load, further delaying followers.
Large Message Problems:

Large messages wreck throughput and increase tail latency. A 10MB message takes 80ms to transfer at 1 Gbps, blocking the entire partition. Store large payloads (images, videos, model weights) in object storage and send pointers in events, keeping messages under 100 KB.

Schema Evolution Breakage:

A producer deploys a new schema version with a removed field; old consumers crash trying to deserialize. Enforce backward compatibility with schema registries and contract tests before deploying producers.

💡 Key Takeaways

✓Rebalance storms are detected by tracking rebalance frequency: more than 1 rebalance per 5 minutes indicates instability. Increase session timeout from default 10 seconds to 30 to 60 seconds, tune max poll interval to exceed percentile 99 batch processing time, and enable cooperative rebalancing to reduce pause duration from 30 to 90 seconds down to sub second.

✓Partition hotspots are diagnosed by per partition lag and throughput metrics: alert when any partition's lag exceeds 2 times the group average or throughput exceeds 10 times the median. Repartition topics with salted or composite keys, or split hot entities across multiple topics.

✓ISR shrink events should page on call immediately: under replicated partitions mean you are one broker failure away from data loss. Check follower fetch lag, disk input/output utilization, and network saturation. Throttle producers using dynamic quotas to let followers catch up before accepting new writes.

✓Large message impact is nonlinear: a single 10 MB message in a stream of 1 KB messages increases tail latency by 100 times and reduces effective throughput by 90 percent on that partition. Set broker side max message size to 1 MB and reject larger messages at ingestion time.

✓Schema compatibility should be enforced at publish time, not runtime: integrate a schema registry with continuous integration and continuous deployment pipelines. Reject producer deployments that break backward or forward compatibility with active consumers, preventing deserialization crashes.

📌 Interview Tips

1A social media platform experienced rebalance storms when a deployment caused 10 second garbage collection pauses across all consumers. Session timeout was set to 10 seconds, so every pause triggered a rebalance, which paused other consumers, cascading into continuous rebalancing. Increasing session timeout to 45 seconds stabilized the group.

2An e-commerce site keyed order events by user_id. A bulk buyer placed 50,000 orders per day, creating a hot partition that lagged 10 minutes behind others. Downstream join operators timed out waiting for events from that partition. They switched to keying by (user_id, day) to spread the bulk buyer's orders across multiple partitions.

3During a network brownout, follower replicas fell 5 seconds behind leaders. ISR dropped to 1 on 20 percent of partitions. Writes with quorum acks blocked, timing out order placement requests. The incident response was to throttle producers by 50 percent using dynamic quotas, allowing followers to catch up in 90 seconds and restore ISR health.

4A machine learning pipeline sent 8 MB model weight updates through Kafka. Throughput dropped from 10,000 events per second to 50 per second, and consumer lag spiked to hours. They moved model weights to S3 and sent 200 byte pointers in events, restoring throughput and reducing latency from seconds to milliseconds.

← Back to Kafka/Event Streaming Architecture Overview