Change Data Capture (CDC) • CDC Data Consistency GuaranteesHard⏱️ ~3 min
CDC Delivery Semantics: At Least Once vs Exactly Once
The Fundamental Trade Off:
CDC consistency guarantees ultimately rest on delivery semantics. The question is deceptively simple: if a change is committed in the source database, how many times does each downstream consumer process it? The answer determines whether you can tolerate duplicates, whether you can afford data loss, and what infrastructure complexity you're willing to accept.
At Least Once: The Pragmatic Default:
At least once delivery guarantees that every committed change reaches consumers at least once, but possibly more than once. This is far easier to implement in distributed systems because it tolerates partial failures gracefully.
Consider the failure mode: a CDC connector reads 1,000 events from the database log and publishes them to a stream. It then attempts to checkpoint its position (save "I've processed up to LSN 5,293,847") but the checkpoint write fails due to a network partition. When the connector restarts, it resumes from its last successful checkpoint at LSN 5,292,000, republishing 847 events that were already sent.
This requires consumers to be idempotent: processing an event twice must produce the same result as processing it once. For an Elasticsearch indexer, idempotency is natural: upserting a document by primary key with the same content twice is safe. For a metrics counter, it's dangerous: incrementing a counter twice for the same event doubles your counts.
Decision Framework:
Choose at least once with idempotent consumers when you can structure operations as upserts or when duplicate processing doesn't corrupt business logic. This covers most use cases: search indexing (upsert documents), cache invalidation (rewriting a cache key is safe), and even analytics if you design for idempotency (write events with unique IDs, deduplicate in the warehouse).
Choose exactly once when duplicate processing would violate invariants and you cannot achieve idempotency. Examples include financial ledgers where double counting transactions causes incorrect balances, or inventory systems where duplicate decrement events could show negative stock. But recognize you're trading 50 to 70 percent throughput for this guarantee.
Real World Hybrid Approaches:
Many production systems use at least once delivery with consumer side deduplication. The CDC pipeline delivers events with unique identifiers. Consumers write to staging tables with event IDs, then use SQL deduplication (INSERT ON CONFLICT DO NOTHING or MERGE with event ID checks) to ensure final tables have exactly once semantics. This pushes the complexity to the database's transaction system, which is often more efficient than distributed exactly once protocols.
At Least Once
Duplicates possible, zero data loss, simple recovery
vs
Exactly Once
No duplicates, complex state, write amplification
⚠️ Common Pitfall: Systems designed for at least once often fail on aggregate operations. A consumer that does UPDATE metrics SET order_count = order_count + 1 on every order event will corrupt counts under duplicates. The fix is to track processed event IDs: UPDATE metrics SET order_count = order_count + 1 WHERE NOT EXISTS (SELECT 1 FROM processed_events WHERE event_id = $1).
Exactly Once: When Precision Matters:
Exactly once delivery guarantees each change is processed precisely once, eliminating duplicates. This requires transactional coordination between the CDC stream and consumer state.
The typical implementation uses a two phase approach: the consumer reads an event, computes its side effects, and commits both the side effects and the processed event offset in a single atomic transaction. If the database supports it, you write "insert order into warehouse" and "update offset to LSN 5,293,847" in one transaction. If that transaction succeeds, the event is processed exactly once. If it fails, neither the side effect nor the offset commits, so retry processes the event again.
The cost is substantial. For a consumer processing 100,000 events per second into a data warehouse, exactly once semantics might require maintaining a deduplication state table with 100,000 rows per second written, plus compaction logic to prevent unbounded growth. This can add 10x storage overhead and reduce throughput by 50 to 70 percent compared to at least once with idempotent operations.
Throughput Trade Off
100k/sec
AT LEAST ONCE
30k/sec
EXACTLY ONCE
💡 Key Takeaways
✓At least once delivery tolerates failures gracefully by allowing duplicates, requiring consumers to implement idempotent operations
✓Exactly once delivery requires transactional coordination between stream offsets and consumer state, adding 50 to 70 percent latency and 10x storage overhead
✓Idempotent consumers using upsert operations (UPDATE by primary key, INSERT ON CONFLICT) achieve exactly once semantics on top of at least once delivery
✓Aggregate operations like counters break under at least once unless you track processed event IDs to prevent double counting
✓Production systems often use hybrid approaches with at least once delivery plus consumer side deduplication using unique event identifiers
📌 Examples
1An Elasticsearch indexer receives duplicate events during a CDC connector restart. It processes both, upserting the same document twice. The final index state is correct because PUT /products/12345 with identical JSON is idempotent.
2A fraud detection system maintains a user risk score by aggregating transaction events. Under at least once delivery, it tracks processed event IDs in a seen_events table. Before incrementing risk_score, it checks IF NOT EXISTS (SELECT 1 FROM seen_events WHERE event_id = current_event_id), preventing double counting.
3A financial ledger requires exactly once semantics. The consumer uses a transaction to both INSERT INTO transactions and UPDATE cdc_offsets SET position = new_position WHERE consumer_id = self. If the transaction fails, neither the ledger entry nor offset update commits, ensuring the event will be reprocessed.