Failure Modes and Edge Cases

Partial Commit Failures:
The most dangerous failure mode in exactly-once systems is partial commit, where some components of the transaction succeed while others fail. Suppose a stream processor updates its internal state successfully but crashes before publishing corresponding output events. Or it publishes outputs but fails before checkpointing the state and offsets.

Without coordinated checkpoints and transactions, recovery produces incorrect results. Replaying from an old checkpoint might re-emit events that downstream systems already processed, creating duplicates. Or it might skip state transitions entirely if the state was updated but not checkpointed, causing data loss.

This is precisely why production systems implement coordinated two phase commit protocols or similar mechanisms. The checkpoint and output commit must succeed or fail atomically as a single unit. Apache Flink achieves this by buffering outputs during checkpoint barrier propagation and only making them visible once the checkpoint completes successfully.

Side Effects Outside Transaction Boundaries:
A subtle but critical limitation: exactly-once semantics only cover operations within the transactional boundary. External side effects like sending emails, calling payment gateways, or updating third party systems cannot usually participate in your streaming engine's transactions.

Even if your internal state and output topics have perfect exactly-once guarantees, users might see duplicate emails or payment gateway logs might show repeated authorization attempts. Many production systems accept effectively once semantics for such side effects, relying on idempotency at the external system.

For example, payment requests include unique transaction IDs that the gateway uses for deduplication. If your system sends the same charge request twice during failure recovery, the gateway processes it once and returns a cached response for the duplicate. This pattern works but requires careful design of external integrations.

❗ Remember: Exactly-once guarantees are only as strong as your weakest link. If any sink or output cannot participate in atomic commits, your end to end pipeline degrades to at least once for that path.
Availability Trade-offs During Outages:
Strict exactly-once semantics can conflict with availability. If your output sink like a data warehouse or key value store becomes unavailable and you require atomicity, the pipeline must pause processing to preserve exactly-once guarantees. This increases end to end latency from milliseconds to minutes or hours.

At very high throughput, for example 1 gigabyte per second of input, buffering during a 10 minute outage means storing 600 gigabytes in memory or on disk. This can exceed capacity budgets and cause backpressure that affects upstream systems.

Some systems provide configuration to trade off availability for consistency. They can switch to at least once delivery during sink outages, allowing processing to continue while accepting potential duplicates that must be cleaned up later. The decision depends on your specific availability requirements and tolerance for temporary inconsistency.

Late Data and Out of Order Processing:
Windowed aggregations and stream joins introduce additional complexity for exactly-once semantics. Consider a 10 minute tumbling window that closes at 10:00. If an event with timestamp 9:58 arrives at 10:03, does it get included in the 10:00 window?

Most systems define watermarks that track event time progress and trigger window closures. Once a window closes and results are output, very late data (arriving after the watermark passes) may be dropped or sent to a side output. The exactly-once guarantee applies to the defined processing logic, not to magical inclusion of arbitrarily late events.

For stream to stream joins with large windows, coordinating exactly-once semantics requires careful alignment of checkpoints on both input sides. If one stream checkpoints and commits while the other rolls back to an earlier point on failure, you may see dropped or duplicated join results. Frameworks handle this through barrier alignment mechanisms that ensure both sides checkpoint consistently.

Failure Scenario Timeline
NORMAL
50ms p99
→
SINK DOWN
Paused
→
10 MIN LATER
600GB buffered
Network Partitions and Split Brain:
In distributed deployments, network partitions can create split brain scenarios where two instances of the same operator believe they are the active processor. Without proper coordination through leader election or distributed consensus, both might write outputs, violating exactly-once guarantees.

Production systems use mechanisms like Apache ZooKeeper or etcd for distributed coordination, ensuring only one instance is active. Fencing tokens prevent old leaders from committing after a new leader takes over. These add complexity but are essential for correctness in the face of network failures.

💡 Key Takeaways

✓Partial commit failures where state updates succeed but outputs fail (or vice versa) are the most dangerous failure mode, requiring coordinated two phase commit

✓Side effects outside the transaction boundary like emails or external API calls cannot be covered by exactly-once guarantees and must be made idempotent separately

✓Sink outages can force a choice between pausing processing (preserving exactly-once) or continuing with at least once semantics, potentially buffering gigabytes of data

✓Late arriving events after window closure must be explicitly handled, with exactly-once guarantees applying to the defined logic rather than magically including arbitrarily late data

✓Network partitions require distributed coordination (ZooKeeper, etcd) and fencing tokens to prevent split brain scenarios where multiple operators process the same data

📌 Interview Tips

1A stream processor updates account balances but crashes before checkpointing. Recovery replays from old checkpoint, processing payments twice and doubling charges without coordinated commits.

2A pipeline sends order confirmation emails. Despite exactly-once semantics for internal state, customers receive duplicate emails because the email service cannot participate in stream transactions.

3During a 10 minute data warehouse outage, a pipeline processing 1 GB/sec must buffer 600 GB. It switches to at least once mode to avoid running out of disk, accepting duplicates cleaned up later.

4A 10 minute window closes at 10:00. An event timestamped 9:58 arrives at 10:03 after the watermark passed. The system drops it per policy, maintaining exactly-once for processed events only.

← Back to Exactly-Once Processing Semantics Overview