Failure Modes: Retries, Reordering, and Poison Messages

Producer Retry Duplicates:

When a publish succeeds but the acknowledgment is lost, the producer retries and the broker receives the message twice. Without idempotent producer semantics (tracking per-producer, per-partition sequence numbers), the duplicate is written again, potentially out of order. Modern Kafka producers use sequence numbers and the broker deduplicates within a window.

Consumer Async Processing Violations:

A consumer reads message 1 then message 2, dispatches both to a thread pool, and message 2 completes before message 1. If the consumer commits offsets immediately after dispatch rather than after ordered completion, the system has processed messages out of order. The solution is single-flight processing per partition or ordered completion queues.

Poison Message Head-of-Line Blocking:

A malformed message or one triggering an external service timeout causes the consumer to fail, retry, and lock up the partition. All subsequent messages wait, accumulating lag. After a bounded retry budget (commonly 3-5 retries with exponential backoff), the message must be moved to a Dead Letter Queue (DLQ).

❗ Remember: Cross-system transactions can hold partitions hostage. If database p99 commits take 500ms, the partition processes at most 2 messages/sec, creating massive lag. Microsoft and Amazon document this as the primary cause of unexpected partition stalls.

💡 Key Takeaways

✓Producer retries without sequence numbers cause duplicates and reordering. Lost acknowledgments trigger retry; broker receives message twice and may interleave with newer messages unless per producer, per partition sequencing is enforced.

✓Async consumer processing breaks order without serialization. Dispatching messages to thread pools completes them out of sequence; must use single flight processing or ordered completion queues that commit offsets in receive order.

✓Poison messages block entire partitions via head of line blocking. One malformed message or external timeout stalls all subsequent messages; requires bounded retry (3 to 5 attempts) then DLQ to unblock partition.

✓Transactional commits for exactly once semantics add latency. Coordinating message offset and database write in single transaction holds partition for database commit time; p99 commit at 500 milliseconds caps throughput at 2 messages per second.

✓Multi region concurrent writers break global order. Active active publishing from multiple regions interleaves events unpredictably; requires region scoped partitions with downstream merge using per key sequence numbers and conflict resolution.

📌 Interview Tips

1Kafka idempotent producer: enables exactly once per partition semantics by tracking producer ID and sequence number per partition, broker deduplicates retries within 5 request window automatically

2LinkedIn event processing: uses serial executor per partition in consumer, processing message N fully (including downstream writes) before reading message N plus 1, accepting lower CPU utilization for ordering guarantee

3Amazon Kinesis consumer: after 3 retries of message failing to write to DynamoDB (malformed payload), logs to DLQ, increments skip metric, and continues to next message to prevent partition stall

← Back to Message Ordering & Partitioning Overview