How Log-Based CDC Works: From Log Entries to Change Events

The Transaction Log Mechanism:

Every transactional database writes changes to a durable, append only log before acknowledging commits. This is fundamental to ACID guarantees. When you execute an UPDATE statement, the database first writes a log record describing what changed, flushes that log to disk, then updates the actual data pages in memory. If the server crashes before those pages are written, recovery replays the log.

For MySQL with row based replication enabled, the binlog contains the full before and after images of each modified row. For PostgreSQL, the Write Ahead Log (WAL) stores physical page changes by default, but with logical decoding enabled, you can configure it to emit logical row level changes suitable for CDC.

The CDC Connector Process:

A CDC connector is a continuously running process that connects to the database (or a dedicated read replica) with special replication permissions. It does NOT query application tables. Instead, it subscribes to the transaction log stream at a specific log sequence number or binlog position.

1
Read log entry: The connector reads a raw log record, which might say "page 47 of table users, offset 512, changed bytes X to Y."
2
Interpret and enrich: The connector translates this into a logical change event: "table users, primary key 123, column email changed from [email protected] to [email protected], transaction ID abc789, source log position 45678."
3
Group by transaction: The connector buffers entries belonging to the same transaction ID and emits them together when the transaction commits, preserving atomicity.
4
Publish to event stream: Events are sent to a message bus like Kafka, typically one topic per table, partitioned by primary key to maintain per entity ordering.
5
Persist offset: The connector saves its current log position (e.g., binlog file and offset) to a durable store so it can resume from the exact same point after a restart.
Handling Initial State: Snapshots:

When you first enable CDC on an existing table with millions of rows, you need both historical data and ongoing changes. The standard pattern is a two phase onboarding. First, the connector takes a snapshot of the table at a specific log position (say, position P). It emits all existing rows as INSERT events. Once the snapshot completes, it switches to tailing the live log starting from position P, seamlessly merging snapshot and streaming data without gaps or duplicates.

✓ In Practice: At scale, taking a snapshot of a 100 GB table can take hours. Production systems often snapshot from a read replica to avoid impacting the primary, and they rate limit snapshot emission to avoid overwhelming downstream consumers with a sudden burst of millions of events.
Metadata and Idempotence:

Each CDC event includes metadata critical for downstream processing: the source table, primary key, operation type (insert, update, delete), transaction ID, source timestamp, and log position. This metadata enables consumers to deduplicate (if they see the same log position twice due to retries), reorder (if cross partition ordering matters), and replay (rewind to a historical position to rebuild derived state).

💡 Key Takeaways

✓The connector reads the transaction log at the physical level (binlog or WAL) and translates low level records into high level logical change events with primary key, before/after values, and metadata

✓Changes are grouped by transaction ID and emitted atomically when the transaction commits, preserving transactional semantics across the stream

✓One topic per table with partitioning by primary key ensures that all changes to a given entity (like user 123) arrive in order at the same consumer

✓Initial onboarding uses a two phase approach: snapshot existing data at log position P, then tail the live log from position P to merge historical and real time data seamlessly

✓Every event includes source log position, enabling consumers to deduplicate on retries, detect gaps, and replay from any historical point

📌 Interview Tips

1A CDC connector reads MySQL binlog entry at position 45678 showing row ID 123 in table orders changed column <code style="padding: 2px 6px; background: #f5f5f5; border: 1px solid #ddd; border-radius: 3px; font-family: monospace; font-size: 0.9em;">status</code> from pending to shipped. It emits a structured JSON event with primary key 123, before value pending, after value shipped, transaction ID, and source position 45678 to Kafka topic orders_changes.

2During snapshot, a connector reads 10 million existing rows from a users table over 2 hours, emitting them as INSERT events. At snapshot completion, it notes the log was at position 99999. It then switches to streaming mode starting at position 99999, ensuring no changes are missed between snapshot and streaming phases.

← Back to Log-based CDC (Binlog, WAL) Overview