Big Data Systems • Lambda & Kappa ArchitecturesHard⏱️ ~2 min
Idempotency and Exactly Once Semantics for Stream Correctness
Idempotency ensures that processing the same event multiple times produces the same result as processing it once, which is critical for correctness when retries and replays are inevitable in distributed streaming systems. Network partitions, transient failures, and reprocessing all cause duplicate event deliveries. Without idempotency, counters get inflated, payments get charged twice, and aggregates become incorrect. The core technique is assigning globally unique event IDs and maintaining a small deduplication index per key and time bucket with time to live (TTL) aligned to replay windows.
Exactly once semantics means the system guarantees each event affects output state exactly once despite retries and failures. This requires coordination between the stream processor, state stores, and external sinks. For counters, use commutative and associative operations like conflict free replicated data type (CRDT) adders that tolerate retries naturally. For database writes, implement upsert semantics using business keys and sequence numbers so replaying the same event overwrites with identical data. Transactional writes where supported batch state updates and sink writes into atomic commits, preventing partial updates that corrupt downstream views.
Failure modes emerge when idempotency assumptions break. External sinks that are not idempotent like incrementing counters or appending to logs cause overcounting on retries. If dedupe windows are too short relative to replay latency, replayed events bypass deduplication and create duplicates. Hot keys where a viral post generates millions of events overload a single partition, causing out of memory failures in the deduplication state store. Mitigation requires composite keys to spread load, hierarchical aggregation with local then global rollups, and pre aggregation before writes to reduce cardinality.
💡 Key Takeaways
•Idempotency ensures processing same event multiple times produces identical result, critical for correctness when network partitions and retries cause duplicate deliveries
•Exactly once semantics requires coordination between stream processor, state stores, and sinks using techniques like upsert with business keys, sequence numbers, and transactional writes
•Globally unique event IDs with small deduplication index per key and time bucket enable idempotent processing, with TTL aligned to maximum replay window duration
•External sinks that are not idempotent like incrementing counters or append only logs cause overcounting on retries, requiring upsert semantics or CRDT commutative operations
•Hot keys from viral content overload single partition with millions of events, causing out of memory in deduplication state, requiring composite keys and hierarchical aggregation
📌 Examples
Payment processing assigns globally unique transaction ID, uses upsert on transaction ID to database so retry of same payment overwrites with identical data instead of double charging
Counter aggregation uses CRDT style commutative addition where processing event ID twice produces same sum, avoiding separate deduplication index for simple metrics
Hot key mitigation: viral post with 5 million interactions uses composite key of post ID plus hour bucket, spreading load across partitions and pre aggregating per partition before global rollup