Distributed Systems Primitives • Distributed Transactions (2PC, Saga)Hard⏱️ ~3 min
Failure Modes and Edge Cases in Distributed Transactions
Both 2PC and Saga patterns face critical failure modes that can cause extended unavailability, data inconsistency, or operational burden if not carefully handled. In 2PC, the most severe failure is coordinator crash after participants have prepared but before the commit decision is issued. Participants enter an "in-doubt" state, holding locks and blocking other transactions indefinitely until the coordinator recovers or timeout policies fire. This can cascade: a single stuck transaction holds locks on hot records, causing queuing and timeouts across the system. Tail latency amplification is another concern: with multiple participants across zones or regions, the 99th percentile commit latency is bounded by the slowest participant's disk flush and network round-trip, often reaching hundreds of milliseconds. Network partitions exacerbate this: if the coordinator cannot reach all participants, the decision may be delayed indefinitely, and heuristic commits or aborts (where participants independently decide) risk permanent divergence between datastores.
Sagas face a different set of challenges centered on compensation and concurrency. Non-compensable side effects such as emails sent, external webhooks fired, or physical shipments initiated cannot be truly undone; compensations can only perform logical counter-actions like refunds or cancellations. Compensation failures are particularly dangerous: if a compensating step fails repeatedly or succeeds partially (for example, refund succeeded but inventory restock failed), the saga can only converge if compensations are idempotent and infinitely retryable; otherwise, manual intervention is required. Concurrency anomalies arise because saga steps are not isolated: two concurrent sagas might both see the last available seat and proceed to reserve it (lost update). Mitigations include semantic locks (marking entities as "pending"), version checks with optimistic concurrency control, or centralizing contention heavy invariants into a single writer pivot service.
Message delivery issues compound saga complexity: duplicate messages, out of order delivery, and poison messages require idempotent handlers, deduplication windows (typically 24 to 72 hours of message identifiers stored), and dead letter queues. Orchestrator unavailability can be a single point of failure if the orchestrator is not replicated; upon failover, the new instance must resume from durable state and safely re-drive steps with at-least-once semantics. Zombie or stranded reservations (inventory reserved but never purchased due to timeouts or retries) require time to live expirations and cleanup workflows to prevent resource exhaustion.
💡 Key Takeaways
•2PC in-doubt state: coordinator crash after prepare leaves participants holding locks indefinitely until recovery, causing cascading timeouts and availability degradation across the system.
•2PC tail latency: with N participants across zones or regions, 99th percentile commit latency often reaches hundreds of milliseconds due to slowest participant's disk flush (0.5 to 5 milliseconds) plus network round-trip times (50 to 150 milliseconds cross region).
•Saga non-compensable side effects: emails, webhooks, and shipments cannot be undone; only logical counter-actions (refunds, cancellations) are possible, requiring business process design to tolerate visible intermediate states.
•Saga concurrency anomalies: lost updates occur when two sagas read the same state (last available seat) and both proceed; mitigate with version checks, semantic locks, or centralized pivot services to serialize decisions.
•Compensation failure scenarios: partial rollback (refund succeeded, inventory restock failed) requires idempotent, infinitely retryable compensations or manual intervention, making operational complexity a key consideration.
•Message delivery challenges: at-least-once delivery requires deduplication windows (24 to 72 hours of message identifiers), idempotent handlers, and dead letter queues for poison messages to prevent lost or duplicate processing.
📌 Examples
2PC coordinator crash: participants prepared to commit a payment transfer, but coordinator crashes before issuing commit. Participants hold locks on both source and destination accounts for minutes until timeout or manual recovery, blocking all other transfers involving those accounts.
Saga lost update: two customers simultaneously book the last concert ticket; both sagas read available inventory as 1, both reserve successfully, but only one payment should be authorized. Without version checks or semantic locks, both sagas complete and oversell the inventory.
Saga compensation failure: order saga cancels after payment authorized and inventory reserved; payment refund compensating transaction succeeds, but inventory restock compensation fails due to service outage. Without infinite retries or manual cleanup, inventory count remains incorrect.
Stranded reservation: customer abandons checkout after inventory reserved; saga orchestrator crashes before timeout fires. Inventory remains reserved indefinitely unless time to live expirations and cleanup workflows release abandoned holds after a business defined window (e.g., 15 minutes).