Distributed Systems Primitives • Idempotency & Retry PatternsEasy⏱️ ~2 min
What is Idempotency and Why Distributed Systems Require It
Idempotency ensures that executing the same operation multiple times produces the same end state and ideally the same response as executing it once. This property is the foundational requirement for safe retries in distributed systems where networks are unreliable, responses can be lost, timeouts are common, and clients may retry without knowing if the original operation succeeded. In practice, distributed systems can only deliver at least once semantics, not exactly once. The industry reality is that exactly once delivery at the API boundary is not achievable under realistic failure assumptions. You achieve exactly once functional behavior by combining at least once delivery with idempotent operations and deduplication mechanisms.
Without idempotency, retries become dangerous. Consider a payment API: if a client times out after 5 seconds but the server successfully charged the card in 3 seconds, a naive retry would double charge the customer. Similarly, an order placement retry could create duplicate orders. The core challenge is that network timeouts create ambiguous outcomes where the client cannot distinguish between "the server never received my request," "the server processed it but the response was lost," and "the server is still processing it." Idempotency allows the client to safely retry because repeating the operation either has no additional effect or returns the same result as the original execution.
Idempotency and retries are inseparable in production systems. Retries without idempotency risk duplicate payments, orders, or resource creation. Idempotency without retries leaves users exposed to transient failure flakiness, degrading user experience during routine network hiccups or service brownouts. Together, they form a resilience pattern that masks transient failures while maintaining correctness guarantees.
💡 Key Takeaways
•At least once delivery is the achievable guarantee in distributed systems; exactly once requires combining at least once with idempotent operations and deduplication.
•Network timeouts create ambiguous outcomes where clients cannot determine if the server processed their request, making retries without idempotency unsafe.
•With a transient failure probability of 0.2 per call, adding 2 retries (3 total attempts) raises success probability to approximately 99.2%, and 4 retries to approximately 99.97%.
•Idempotency prevents duplicate side effects such as double charges in payment systems or duplicate orders in ecommerce when clients retry after timeouts.
•Retries without idempotency risk double application; idempotency without retries exposes users to transient network and service failures that could otherwise be masked.
•The combination enables safe automatic recovery from transient failures including connection resets, packet loss, and temporary service overload that typically resolve within milliseconds to seconds.
📌 Examples
A Stripe payment client times out after 5 seconds, but the charge succeeded in 3 seconds and only the response was lost. The client retries with the same idempotency key, and Stripe returns the original charge object rather than creating a duplicate charge.
An Amazon retail client double clicks the Place Order button due to UI lag. The second request carries the same operation token, and the order service returns the existing order confirmation instead of creating a duplicate order.
An Uber trip dispatch event is replayed due to Kafka consumer restart. The consumer performs an upsert keyed by trip UUID, ensuring the trip state converges to the same result regardless of how many times the event is processed.