Message Queues & StreamingDelivery Guarantees (At-least-once, Exactly-once)Hard⏱️ ~2 min

Failure Modes and Edge Cases in Delivery Guarantees

Understanding where delivery guarantees break down is critical for building robust systems. The most common failure is duplicate side effects: a consumer crashes after applying a side effect but before acknowledging the message or committing the read position. The broker redelivers, causing duplicate application. Mitigation requires idempotent side effects via unique operation identifiers and conditional writes, or transactional outbox to unify state change and acknowledgment. Lost dedup state causes duplicates to slip through the exactly once processing barrier. Dedup caches with time to live (TTL) can evict entries before all duplicates arrive (e.g., during long pauses or reordering). If a retry arrives after eviction, duplicates slip through. Compaction in log based stores can also remove historical markers. Mitigate by right sizing TTLs to cover maximum expected replay windows and using durable dedup for critical operations. For systems processing 100,000 messages per second with p99 retry delays of 2 minutes, a 5 minute TTL provides only 3 minutes of safety margin; extend to 10 or 15 minutes or use persistent dedup. Visibility or acknowledgment timeouts tuned too tight cause redelivery storms. If processing often exceeds the visibility deadline, messages will redeliver mid processing, magnifying duplicates under load. Size timeouts to p99 processing time with substantial headroom and implement heartbeat or deadline extension mechanisms. Transaction coordinator failures in two phase commit sinks leave in flight transactions uncertain, increasing latency and causing retries or aborts; ensure bounded transaction timeouts and idempotent commit or abort actions.
💡 Key Takeaways
Crash after side effect but before acknowledgment causes redelivery and duplicate application. Mitigate with idempotent side effects (unique operation identifiers, conditional writes) or transactional outbox.
Dedup caches with TTL can evict entries before all duplicates arrive. For 100,000 messages per second with p99 retry delay of 2 minutes, a 5 minute TTL leaves only 3 minutes of margin. Use 10 to 15 minute TTLs or persistent dedup for critical flows.
Visibility timeouts tuned too tight (e.g., p99 processing 2.5 seconds with 3 second deadline) cause redelivery storms under load. Size to p99 plus substantial headroom; implement heartbeat or deadline extensions for long processing.
Exactly once across non transactional external systems (email, webhooks, third party APIs) requires idempotency keys in downstream APIs or proxying through an internal gateway that enforces idempotency.
Partial batch commits raise risk: on failure, subset of batch may commit and subset retries, creating duplicates. Use per record idempotency inside batches or atomic batch operations with unique constraints.
Clock skew in systems using lease or fencing tokens can cause overlapping ownership and double processing. Prefer monotonic counters and compare and set over time based leases; validate tokens on each write.
📌 Examples
An order fulfillment consumer charges a credit card (external API call) and then crashes before acknowledging the message. The broker redelivers, causing a duplicate charge. Solution: pass an idempotency key to the payment API derived from order_id and action_type so retries return the original charge.
A stream processor with a 2 minute dedup TTL experiences a 3 minute network partition. When connectivity restores, replay delivers messages already processed 3 minutes ago. The dedup cache has evicted those identifiers, causing duplicate writes to the database. Solution: extend TTL to 10 minutes or use a persistent dedup table.
A Lambda function consuming Kinesis records has a 5 second timeout but p99 processing time is 4.8 seconds. Under load, 1 percent of invocations time out and are retried, causing duplicate writes. Solution: increase timeout to 10 seconds and implement conditional writes in DynamoDB to make side effects idempotent.
← Back to Delivery Guarantees (At-least-once, Exactly-once) Overview
Failure Modes and Edge Cases in Delivery Guarantees | Delivery Guarantees (At-least-once, Exactly-once) - System Overflow