Change Data Capture (CDC): Capturing Deltas with Log Sequence Numbers and Idempotency

Change Data Capture streams only the changes (inserts, updates, deletes) from source databases by reading transaction logs or binlogs. CDC minimizes load on the source and reduces downstream compute by shipping incremental deltas rather than full snapshots. The complexity lies in ordering, idempotency, and handling schema drift.

CDC workflows start with a consistent snapshot of the source table, recording the Log Sequence Number (LSN) or System Change Number (SCN) at snapshot time. After the snapshot completes, the pipeline applies log changes from that position forward, strictly ordered by commit sequence. Each change carries the source primary key, LSN, and operation type (insert, update, delete). Downstream sinks perform upserts keyed by primary key and drop changes with older LSNs to ensure idempotency during retries or rebalancing.

The failure modes are subtle. If log retention is too short for long-running snapshots, you miss changes and create gaps. Out-of-order application of updates yields incorrect final state, especially for multi-row transactions. Schema changes in the source can break downstream parsers unless you version schemas and route to side-by-side tables during migration windows. Monitor replication lag by tracking the difference between the current source LSN and the last applied LSN; alert on gaps or lag exceeding Service Level Objectives (SLOs).

CDC is the backbone of real-time data replication. Amazon teams use CDC to replicate operational databases into data lakes and warehouses, maintaining sub-minute freshness for high-value tables. For example, a CDC pipeline on a busy Orders table might ship 50,000 changes per second during peak traffic. At 1 kilobyte per change, that is 50 megabytes per second sustained throughput. Full snapshots would hammer the source with gigabytes of reads every hour; CDC reduces this to incremental deltas with minimal source impact.

💡 Key Takeaways

✓CDC captures only changes by reading database logs, minimizing source load and reducing downstream compute compared to full snapshots.

✓Start with a consistent snapshot and record the Log Sequence Number (LSN) or System Change Number (SCN). Apply log changes from that position forward in strict commit order.

✓Downstream sinks must upsert by primary key and drop changes with older LSNs to achieve idempotency during retries and consumer rebalancing.

✓Failure modes include log retention gaps if snapshots run too long, out-of-order application causing incorrect state, and schema drift breaking parsers. Monitor LSN lag and validate continuity.

✓At scale: a busy Orders table shipping 50k changes per second at 1 KB each sustains 50 MB/s throughput. Full snapshots would cause gigabytes per hour of source load; CDC reduces this to incremental deltas.

📌 Interview Tips

1Amazon CDC pattern: replicate operational databases to data lakes with sub-minute freshness. Snapshot captures initial state at LSN=1000, then stream applies changes at LSN=1001, 1002, etc. Downstream sinks perform upserts keyed by (table PK, LSN) to drop duplicates.

2Monitoring: track current source LSN minus last applied LSN. Alert if lag exceeds SLO (e.g., 5 minutes) or if LSN differences show gaps, indicating missed changes due to log retention expiry.

← Back to ETL Pipelines & Data Integration Overview