Data Processing PatternsChange Data Capture (CDC)Medium⏱️ ~2 min

CDC Event Structure and Delivery Guarantees

A CDC change event envelope is more than just the new row data. It typically includes the operation type (insert, update, or delete), before and after images of the row, transaction and commit identifiers, commit timestamp, and the source position such as Log Sequence Number (LSN) or Global Transaction Identifier (GTID). This rich metadata enables downstream systems to replay changes in order, detect duplicates, and maintain consistency. Most CDC systems deliver at least once semantics, meaning the same change event might be delivered multiple times during failures or restarts. This is a deliberate trade off: achieving exactly once delivery end to end is expensive and complex, requiring distributed transactions across the source, broker, and sink. Instead, production systems accept at least once delivery and make consumers idempotent. For example, upserts keyed by primary key plus a version number will ignore duplicate older events automatically. Ordering guarantees are typically per partition or per key, not global. If you partition your CDC stream by user ID, all changes for user 12345 arrive in commit order, but changes for different users may interleave. This matters when a single logical transaction updates multiple entities. At 50,000 events per second with 500 byte events, you need at least 50 Kinesis shards (each provides 1,000 records per second and 1 MB per second), and each shard maintains its own ordering. The bootstrap phase is critical to avoid gaps or duplicates. You take a consistent snapshot of existing data tied to a precise log position, then start streaming CDC from the next position after the snapshot. If you snapshot at time T but start CDC from an earlier position, you'll see duplicates. If you start from a later position, you'll miss changes that occurred during the snapshot. Airbnb's Debezium style MySQL CDC pipelines implement this two phase handoff to ensure completeness.
💡 Key Takeaways
Change events include before and after images, enabling audit trails and replay, but increase payload size by 2x. Column level masks can reduce bandwidth but complicate replays
At least once delivery means duplicates are possible during failures. Use deterministic upserts with stable keys and version checks to achieve effect through idempotency
Ordering is per partition, not global. With 50,000 events per second and 500 byte events on Kinesis, you need at least 50 shards (1,000 records per second and 1 MB per second per shard)
Bootstrap requires a consistent snapshot tied to a precise log position (such as LSN), then streaming from the next position to avoid gaps or duplicates during the handoff
Transaction identifiers enable grouping multi row transactions, critical when a single business operation updates multiple tables or entities atomically
Commit timestamps enable last writer wins conflict resolution in multi region setups, but clock skew can cause older updates to overwrite newer ones incorrectly
📌 Examples
Idempotent upsert in SQL: UPDATE inventory SET quantity = 42, version = 5 WHERE id = 'SKU123' AND version < 5 prevents applying older duplicate events
Airbnb CDC: Uses Debezium style MySQL CDC with consistent snapshot at LSN X, then starts streaming from LSN X+1 to ensure no gaps in the data pipeline
DynamoDB Global Tables: Uses streams with last writer wins based on timestamps; requires accurate clock synchronization and idempotent updates for multi region writes
← Back to Change Data Capture (CDC) Overview
CDC Event Structure and Delivery Guarantees | Change Data Capture (CDC) - System Overflow