Transactional Outbox Pattern for Reliable Publishing

The Dual Write Problem:

The dual write problem occurs when you need to atomically update a database and publish a message: if you write to the database then publish, a crash after the commit but before publishing loses the message; if you publish then write, a crash loses the database update but the message is already sent. The transactional outbox pattern solves this by writing both the domain state change and a "to be published" event record in a single database transaction.

Three-Component Implementation:1
Application transaction: Write to both domain tables (orders, accounts) and an outbox table with id, aggregate_id, event_type, payload, created_at, published_at.
2
Relay process: Poll outbox table for unpublished rows, publish each to the message queue, then mark published_at.
3
Idempotency handling: If relay crashes after publishing but before marking, it will retry; the queue deduplication or consumer idempotency must handle this.
Trade-offs:

This pattern provides guaranteed at-least-once delivery: as long as the database transaction commits, the event will eventually be published, even if the relay crashes repeatedly. The trade-off is added latency (events are published milliseconds to seconds after commit, depending on relay polling interval) and operational complexity (the relay is a new component to monitor and scale). Uber and DoorDash use variants of this pattern extensively.

Alternative: Change Data Capture:

An alternative is Change Data Capture (CDC) where a tool like Debezium tails the database transaction log and publishes changes as events. This eliminates the outbox table but couples your event stream to database internals, requires operating the CDC pipeline, and makes schema evolution harder since events are shaped by table structure rather than domain logic.

💡 Key Takeaways

✓Solves dual write atomicity: the database transaction commits both domain state and outbox record together; even if the relay crashes, the event will eventually be published since it's durably stored in the outbox table

✓Relay polling interval determines publishing latency: polling every 100 milliseconds adds ~100 ms average delay; polling every 5 seconds adds ~2.5 seconds; tune based on consistency versus throughput requirements

✓Outbox table becomes write amplification: every business transaction writes an extra row; at 10,000 transactions per second you're writing 10,000 outbox rows per second; plan for outbox table growth and archival strategy

✓Idempotency still required downstream: if the relay publishes but crashes before marking published_at, it will republish on next poll; consumers must deduplicate using event ID or business key

✓Operational complexity trade off: adds a relay component to deploy, scale, and monitor; relay failures block event publishing and create backlog in the outbox table; consider using change data capture tools for simpler infrastructure at cost of coupling to database internals

📌 Interview Tips

1Uber's event driven architecture: services write domain changes and events to local Postgres outbox tables; dedicated relay workers poll outboxes every 500 ms and publish to Kafka, ensuring riders and drivers see consistent state even during network partitions

2DoorDash order service: when an order is placed, the service writes to the orders table and an outbox table in one transaction; a separate publisher service queries the outbox, sends events to SQS, and marks them published; this guarantees downstream notification and fulfillment services never miss orders

← Back to Message Queue Fundamentals Overview