What is Log-Based Change Data Capture (CDC)?

Definition
Log-Based Change Data Capture (CDC) reads a database's internal transaction log to capture every insert, update, and delete as it happens, converting these changes into structured events that other systems can consume in near real time.
The Core Problem:

Imagine you have a MySQL database handling customer orders at 20,000 transactions per second. Your search team needs order data in Elasticsearch, your analytics team needs it in a data warehouse, and your cache needs to know when to invalidate stale entries. How do you get all these changes to these systems without crushing your primary database?

Traditional approaches fail at scale. Query based polling (checking for changes every few minutes using timestamps) adds heavy read load to your busiest tables and can miss deletions entirely. Trigger based CDC (running code on every row change) adds 10 to 30 percent latency overhead in your critical write path.

How Log-Based CDC Works:

Every database already maintains a transaction log for durability and replication. In MySQL, this is the binary log or binlog. In PostgreSQL, it is the Write Ahead Log (WAL) with logical decoding. When you commit a transaction, the database appends it to this log before acknowledging success.

Log based CDC simply tails this existing log, much like how database replicas work. A CDC connector reads entries from the log, interprets each low level record, and emits high level change events like "row with primary key 123 was updated from status pending to shipped at timestamp T, in transaction Z."

Typical End to End Performance
50-200ms
LATENCY
0%
OVERHEAD
Why This Matters:

Because you are reading from the same mechanism used for database replication, log based CDC is complete (captures all change types with before and after values), ordered (preserves transaction sequence), and low overhead (adds no extra work to the primary database's write path). Companies like LinkedIn, Uber, and Netflix use this approach to stream millions of database changes per second into downstream systems.

💡 Key Takeaways

✓Reads the database's internal transaction log (binlog in MySQL, WAL in PostgreSQL) instead of querying tables directly

✓Captures all change types (inserts, updates, deletes) with before and after values in near real time, typically 50 to 200 milliseconds end to end

✓Adds zero overhead to the primary database because it uses the same log already written for durability and replication

✓Preserves transaction ordering and completeness, giving you the exact same ground truth that database replicas use

✓Enables multiple downstream systems (search, analytics, caching) to consume changes independently without touching the primary database

📌 Interview Tips

1A MySQL database at 20,000 writes per second generates 5 to 20 MB per second of binlog data. A CDC connector reads this stream and publishes structured events to Kafka topics, which downstream consumers process within 100 to 300 milliseconds for use cases like cache invalidation and search indexing.

2PostgreSQL enables logical decoding on its Write Ahead Log (WAL), allowing a CDC connector to subscribe and see row level changes (user 123 updated email from [email protected] to [email protected]) rather than low level page modifications.

← Back to Log-based CDC (Binlog, WAL) Overview