What is Change Data Capture (CDC)?

Change Data Capture (CDC) is a design pattern that continuously extracts committed inserts, updates, and deletes from a source database and emits them as an ordered change stream with minimal impact on the source system. Think of it as a firehose of everything changing in your database, delivered in real time.

The most robust implementation is log based CDC, which reads the database's own commit log (such as PostgreSQL Write Ahead Log (WAL), MySQL binary log (binlog), or MongoDB operations log (oplog)). This approach is powerful because the database already writes these logs for crash recovery and replication, so you're essentially tapping into an existing stream rather than creating new overhead. For example, at Amazon, DynamoDB Streams publishes every item level change within typically under 1 second of the write, scaling automatically with table partitions.

The alternative approaches trade off simplicity for performance. Trigger based CDC fires custom logic on every row change but adds write path latency and contention to your production database. Query based CDC periodically scans tables to find differences, which is easiest to implement but causes high read amplification and staleness. If your system handles 50,000 transactions per second, trigger based CDC adds overhead to every single one of those writes, while log based CDC reads a stream the database was already producing.

CDC enables critical use cases like near real time analytics, data warehouse ingestion, cache invalidation, search indexing, and cross region replication. It decouples your analytical workloads from your operational database, protecting the hot path while providing a reliable, append only history to reconstruct state elsewhere.

💡 Key Takeaways

✓Log based CDC reads the database's commit log (WAL, binlog, oplog) that already exists for recovery, adding minimal overhead to the source system

✓Trigger based CDC adds logic to every write operation, introducing 5 to 20ms of additional latency per transaction and contention on production workloads

✓Query based CDC scans tables periodically for changes, causing high read amplification but is easiest to implement when log access is unavailable

✓Change events include operation type, before and after images, transaction identifiers, commit timestamp, and source position for replay and ordering

✓At Amazon, DynamoDB Streams typically publishes changes within under 1 second, enabling cross region replication with sub second propagation in normal conditions

✓CDC streams are typically at least once delivery, requiring downstream consumers to implement idempotent processing and preserve per key ordering

📌 Interview Tips

1DynamoDB Streams at Amazon: Every item change published within <1s, used by Global Tables for cross region replication with last writer wins conflict resolution

2Uber's MySQL binlog CDC to Kafka: Handles millions of messages per second for search indexing with sub second to low second latencies

3AWS Database Migration Service (DMS): Log based CDC from Oracle/MySQL/PostgreSQL to Kinesis/S3/Redshift, maintaining sub second replication lag for tens of thousands of row changes per second

← Back to Change Data Capture (CDC) Overview