Message Queues & StreamingMessage Ordering & PartitioningHard⏱️ ~2 min

Failure Modes: Retries, Reordering, and Poison Messages

Producer Retry Duplicates: When a publish succeeds but the acknowledgment is lost, the producer retries and the broker receives the message twice. Without idempotent producer semantics (tracking per-producer, per-partition sequence numbers), the duplicate is written again, potentially out of order. Modern Kafka producers use sequence numbers and the broker deduplicates within a window. Consumer Async Processing Violations: A consumer reads message 1 then message 2, dispatches both to a thread pool, and message 2 completes before message 1. If the consumer commits offsets immediately after dispatch rather than after ordered completion, the system has processed messages out of order. The solution is single-flight processing per partition or ordered completion queues. Poison Message Head-of-Line Blocking: A malformed message or one triggering an external service timeout causes the consumer to fail, retry, and lock up the partition. All subsequent messages wait, accumulating lag. After a bounded retry budget (commonly 3-5 retries with exponential backoff), the message must be moved to a Dead Letter Queue (DLQ).
❗ Remember: Cross-system transactions can hold partitions hostage. If database p99 commits take 500ms, the partition processes at most 2 messages/sec, creating massive lag. Microsoft and Amazon document this as the primary cause of unexpected partition stalls.
💡 Key Takeaways
Producer retries without sequence numbers cause duplicates and reordering. Lost acknowledgments trigger retry; broker receives message twice and may interleave with newer messages unless per producer, per partition sequencing is enforced.
Async consumer processing breaks order without serialization. Dispatching messages to thread pools completes them out of sequence; must use single flight processing or ordered completion queues that commit offsets in receive order.
Poison messages block entire partitions via head of line blocking. One malformed message or external timeout stalls all subsequent messages; requires bounded retry (3 to 5 attempts) then DLQ to unblock partition.
Transactional commits for exactly once semantics add latency. Coordinating message offset and database write in single transaction holds partition for database commit time; p99 commit at 500 milliseconds caps throughput at 2 messages per second.
Multi region concurrent writers break global order. Active active publishing from multiple regions interleaves events unpredictably; requires region scoped partitions with downstream merge using per key sequence numbers and conflict resolution.
📌 Interview Tips
1Kafka idempotent producer: enables exactly once per partition semantics by tracking producer ID and sequence number per partition, broker deduplicates retries within 5 request window automatically
2LinkedIn event processing: uses serial executor per partition in consumer, processing message N fully (including downstream writes) before reading message N plus 1, accepting lower CPU utilization for ordering guarantee
3Amazon Kinesis consumer: after 3 retries of message failing to write to DynamoDB (malformed payload), logs to DLQ, increments skip metric, and continues to next message to prevent partition stall
← Back to Message Ordering & Partitioning Overview
Failure Modes: Retries, Reordering, and Poison Messages | Message Ordering & Partitioning - System Overflow