Distributed Systems PrimitivesDistributed Transactions (2PC, Saga)Medium⏱️ ~3 min

Saga Pattern: Eventual Consistency Through Compensation

The Saga pattern breaks a business transaction into a sequence of independent local transactions, each executing and committing within its own service boundary. If a step fails, the saga executes compensating transactions to semantically undo the effects of previously completed steps, achieving eventual consistency rather than immediate atomicity. This design trades the strong guarantees of 2PC for higher availability, better fault tolerance, and improved throughput because no distributed locks are held and services can proceed independently. Sagas are particularly well suited for long running processes spanning seconds to days (such as order fulfillment, travel bookings, or approval workflows) and for cross-team, cross-service boundaries where tight coupling would be prohibitive. Sagas can be coordinated via orchestration or choreography. Orchestration uses a central workflow engine (such as Uber's Cadence/Temporal or Netflix Conductor) that persists saga state, issues commands to services, awaits replies, and schedules retries and compensations. The orchestrator adds tens to hundreds of milliseconds per state transition in managed engines, which is acceptable for business processes but not for low latency paths. Choreography decentralizes coordination: services publish domain events and react to others' events using correlation identifiers to track saga progress. Choreography avoids a single point of failure but distributes complexity across services and requires careful design to prevent cyclic dependencies. Uber's Cadence (now Temporal) orchestrates millions of workflows daily in production, with orchestration overhead in the low milliseconds for scheduling decisions, while end to end latency is dominated by the services being called. Netflix Conductor handles millions of workflow executions per day across over 100 services, making compensation, retries, and timeouts explicit. AWS Step Functions offers managed orchestration with standard workflows charging approximately $0.025 per 1,000 state transitions and adding tens to hundreds of milliseconds per transition in practice; express workflows optimize for higher throughput and lower latency at different pricing.
💡 Key Takeaways
No distributed locks means higher concurrency and availability, but intermediate states are externally visible and developers must handle anomalies such as dirty reads and lost updates via idempotency and version checks.
Orchestration (central workflow engine) adds tens to hundreds of milliseconds per state transition in managed engines; acceptable for business processes but unsuitable for low latency hot paths.
Choreography (event driven coordination) distributes complexity and avoids single points of failure, but requires robust correlation, deduplication, and bounded context design to prevent cyclic dependencies.
Uber Cadence/Temporal orchestrates millions of workflows daily with milliseconds of orchestration overhead; end to end latency is dominated by service call times, not the orchestration engine.
AWS Step Functions standard workflows cost approximately $0.025 per 1,000 state transitions; a 10 step saga incurs roughly $0.00025 per execution in state transition fees before service call costs.
Compensations must be idempotent and semantically reverse prior effects; non-compensable side effects (emails sent, external webhooks fired) can only be logically countered, not fully undone.
📌 Examples
Order fulfillment saga: reserve inventory, authorize payment, schedule shipment. If shipment scheduling fails, compensate by releasing payment authorization and restoring inventory reservation. Each step commits locally; intermediate state (payment authorized but not captured) is visible until completion.
Netflix Conductor orchestrating video encoding workflow: transcode source, generate thumbnails, update metadata, publish content. If metadata update fails after transcoding completes, compensation deletes generated artifacts and marks job as failed for retry.
Travel booking choreography: flight service publishes FlightReserved event, hotel service listens and reserves room, publishes HotelReserved event. If car rental fails, hotel service listens to CarRentalFailed event and publishes HotelCancelled, triggering flight service to cancel reservation via FlightCancelled event.
← Back to Distributed Transactions (2PC, Saga) Overview