Message Queues & Streaming • Message Ordering & PartitioningHard⏱️ ~2 min
Failure Modes: Retries, Reordering, and Poison Messages
Producer retries can introduce duplicates and break ordering unless brokers enforce per sender sequence numbers. When a publish succeeds but the acknowledgment is lost, the producer retries and the broker receives the message twice. Without idempotent producer semantics (tracking per producer, per partition sequence numbers), the duplicate is written again, potentially out of order if interleaved with newer messages. Modern Kafka producers use sequence numbers and the broker deduplicates within a window, but legacy systems or misconfigured clients still hit this.
Consumer side async processing violates ordering if not carefully controlled. A consumer reads message 1 then message 2 from a partition, dispatches both to a thread pool for parallel processing, and message 2 completes before message 1. If the consumer commits offsets immediately after dispatch rather than after ordered completion, the system has processed messages out of order. The solution is single flight processing per partition (process message N fully before starting N plus 1) or ordered completion queues that buffer results and commit them in sequence.
Poison messages create head of line blocking that stalls entire partitions. A malformed message or one triggering an external service timeout causes the consumer to fail, retry, fail again, locking up the partition. All subsequent messages for that partition wait, accumulating lag and violating Service Level Agreements (SLAs). After some bounded retry budget (commonly 3 to 5 retries with exponential backoff), the message must be moved to a Dead Letter Queue (DLQ) and processing must continue, or the partition remains blocked indefinitely.
Cross system transactions for exactly once semantics can hold partitions hostage. Committing message offset and database write in a single transaction ensures exactly once but adds the database commit latency to every message. If the database is slow (p99 commits at 500 milliseconds), the partition processes at most 2 messages per second, creating massive lag. Microsoft and Amazon document that transactional commit latencies are the primary cause of unexpected partition stalls in production.
💡 Key Takeaways
•Producer retries without sequence numbers cause duplicates and reordering. Lost acknowledgments trigger retry; broker receives message twice and may interleave with newer messages unless per producer, per partition sequencing is enforced.
•Async consumer processing breaks order without serialization. Dispatching messages to thread pools completes them out of sequence; must use single flight processing or ordered completion queues that commit offsets in receive order.
•Poison messages block entire partitions via head of line blocking. One malformed message or external timeout stalls all subsequent messages; requires bounded retry (3 to 5 attempts) then DLQ to unblock partition.
•Transactional commits for exactly once semantics add latency. Coordinating message offset and database write in single transaction holds partition for database commit time; p99 commit at 500 milliseconds caps throughput at 2 messages per second.
•Multi region concurrent writers break global order. Active active publishing from multiple regions interleaves events unpredictably; requires region scoped partitions with downstream merge using per key sequence numbers and conflict resolution.
📌 Examples
Kafka idempotent producer: enables exactly once per partition semantics by tracking producer ID and sequence number per partition, broker deduplicates retries within 5 request window automatically
LinkedIn event processing: uses serial executor per partition in consumer, processing message N fully (including downstream writes) before reading message N plus 1, accepting lower CPU utilization for ordering guarantee
Amazon Kinesis consumer: after 3 retries of message failing to write to DynamoDB (malformed payload), logs to DLQ, increments skip metric, and continues to next message to prevent partition stall