Distributed Systems PrimitivesIdempotency & Retry PatternsHard⏱️ ~3 min

Failure Modes and Edge Cases in Idempotency and Retry Systems

Even well designed idempotency and retry systems face subtle failure modes that can break correctness or availability under edge cases. Retry storms occur when a regional outage or brownout causes thousands of clients to retry simultaneously without jitter, overwhelming the recovering service and extending the outage. Without full jitter and per client retry budgets, retries correlate and arrive in synchronized waves. AWS research demonstrated that full jitter significantly reduces this correlation compared to plain exponential backoff, and circuit breakers provide additional protection by failing fast on persistent faults rather than continuing to retry. Lost responses create ambiguous outcomes where the client times out after the server commits, and a retry might double apply if the operation is not idempotent. The server must use idempotency keys or state based deduplication so the retry returns the original result rather than re-executing the operation. Parameter drift on duplicate keys is a critical correctness issue: if a client reuses the same idempotency key with different parameters such as a different payment amount or recipient, accepting the request would break the idempotency guarantee and risk unintended state changes. Stripe explicitly rejects such requests and returns an error. Similarly, time to live and window mismatch can cause duplicates to slip through: if the idempotency window is shorter than the retry horizon for example a client retries hours later but the server evicted the key after one hour, the server treats the retry as a new request and double applies. Align key time to live with maximum retry horizons including delayed network replays and client side queueing. Concurrency races are another common pitfall: two processes handling the same idempotency key concurrently must be serialized by a uniqueness constraint or atomic compare and swap. Otherwise, both can pass the not found check and double apply. Partial side effects pose a challenge even with idempotent writes: downstream effects such as emails, webhooks, or third party API calls can duplicate under retries. Use an outbox table or event log with deduplication identifiers per effect and exactly once in outbox processing semantics, where a separate dispatcher reads the outbox and delivers effects with idempotent receivers or sender side deduplication. In stream processing, consumer restarts and orchestrator retries cause replays; if the deduplication control table write and the data write are not in the same transaction, you can commit one without the other, causing either lost progress or duplicates. Cross region deployments with active active traffic face clock skew and keyspace partitioning challenges: idempotency windows relying on timestamps can be impacted by skew, so use monotonic server side times for windowing rather than client times. Ensure the idempotency keyspace and uniqueness guarantees are global or partitioned by a stable routing key to prevent the same token from being processed in multiple regions.
💡 Key Takeaways
Retry storms from synchronized client retries during outages can overwhelm recovering services; full jitter and circuit breakers reduce correlation and provide fail fast behavior.
Lost responses after server commit create ambiguous outcomes; idempotency keys ensure retries return the original result rather than double applying the operation.
Parameter drift where the same idempotency key arrives with different parameters must be rejected; Stripe returns an error rather than accepting mismatched requests.
Time to live and retry horizon mismatch allows duplicates when clients retry after key eviction; align deduplication windows with maximum expected retry delays including queued retries.
Concurrency races require uniqueness constraints or atomic compare and swap to serialize duplicate key handling; both threads passing a not found check leads to double execution.
Partial side effects such as emails and webhooks can duplicate even with idempotent writes; use an outbox table with effect level deduplication identifiers and exactly once in outbox dispatch.
📌 Examples
Retry storm: A regional load balancer failure causes 10,000 clients to timeout simultaneously. Without jitter, all retry at T+1s, T+2s, T+4s in synchronized waves, overwhelming the recovering backend. With full jitter, retries spread across 0 to 1s, 0 to 2s, 0 to 4s intervals.
Parameter drift: Client sends payment request with Idempotency-Key: abc123 and amount: 100. Request times out, client mistakenly retries with same key but amount: 200. Server detects parameter mismatch, rejects request, returns error to prevent incorrect charge.
Concurrency race: Two API servers receive the same idempotency key simultaneously. Both query deduplication store, find no existing record, and attempt to insert. One succeeds, the other fails on unique constraint, queries again, and returns the result from the first server.
Partial side effect: Order creation writes order record idempotently but also triggers email send. Retry after timeout duplicates the email. Fix: write order and outbox entry {effect_id: email_order_123, type: email, to: [email protected]} in one transaction; separate dispatcher deduplicates by effect_id.
Stream processing partial commit: Consumer updates trip status and attempts to insert event_id into processed_events, but database fails before commit. Kafka offset is not advanced. On replay, event_id is missing, update re-applies, offset advances. If offset advanced before commit, replay would be lost.
← Back to Idempotency & Retry Patterns Overview
Failure Modes and Edge Cases in Idempotency and Retry Systems | Idempotency & Retry Patterns - System Overflow