Failure Modes and Advanced Challenges in Log-Based CDC

Consumer Lag and Disk Exhaustion:

The most dangerous failure mode is runaway lag. Imagine your CDC connector is processing at 50,000 events per second, keeping up with the database. Then a downstream Kafka broker has a network partition, and event publishing stalls for 2 minutes. During this time, the database continues writing at 50,000 transactions per second, generating 6 million uncommitted log entries.

In PostgreSQL with logical replication, these Write Ahead Log (WAL) segments cannot be recycled until they are consumed. If your primary database has 100 GB of available disk and the WAL is growing at 20 MB per second, you have less than 90 minutes before disk fills completely. When disk space hits zero, PostgreSQL refuses new writes, and your application goes down.

Disk Exhaustion Timeline
NORMAL
5 GB WAL
→
LAG STARTS
30 GB WAL
→
CRITICAL
95 GB WAL

Production systems prevent this with monitoring and automated remediation. Monitor the delta between current log position and consumer position. Alert when lag exceeds thresholds: warning at 5 minutes or 500 MB, critical at 15 minutes or 2 GB. Remediation can include throttling non critical consumers, scaling out consumer capacity, or in emergencies, temporarily disabling CDC to allow log recycling.

Duplicate Events and Idempotence:

Log based CDC typically provides at least once delivery. If the CDC connector crashes after reading log position 45000 but before persisting that offset, it restarts from the last saved position, say 44500. It reprocesses 500 entries, emitting duplicate events.

Downstream consumers MUST be idempotent. The standard pattern is upserting by primary key. For example, a data warehouse table has columns user_id, email, and updated_at. When processing a CDC event, upsert with a condition: only update if the incoming updated_at is newer than the existing value. This handles both duplicates and out of order delivery.

For audit logs or event stores where duplicates are unacceptable, include the source log position in the event and deduplicate based on it. A consumer maintains a table of processed log positions and skips events it has already seen.

Schema Evolution Challenges:

When you add a NOT NULL column without a default to a table, older CDC events do not have that field. If downstream consumers expect the new schema immediately, they fail. The safe pattern is backward compatible evolution: add columns as nullable, deploy consumers that handle both old and new schemas, wait for all old events to drain (check lag is zero), then optionally make the column NOT NULL.

Changing primary keys is even harder. If you change a user's user_id from 123 to 456, CDC emits a DELETE for 123 and an INSERT for 456. Downstream consumers that partition by primary key will route these to different partitions, potentially processing them out of order. Robust systems use a schema registry (like Confluent Schema Registry) to enforce compatibility rules and include both old and new schemas in events during transitions.

❗ Remember: Schema changes require coordination. The sequence is: add column as nullable in database, deploy consumers that tolerate the new field, wait for lag to clear, then make the column required if needed. Reversing this order causes consumer crashes.
Cross Region and Multi Source Ordering:

In multi region setups with bidirectional replication, changes can originate in different regions and affect the same entity. Suppose user 123's email is updated in Region A at timestamp T1, and their phone is updated in Region B at timestamp T2, where T2 is slightly earlier than T1 due to clock skew. CDC events from both regions arrive at a central consumer, but network delays cause the T2 event to arrive first.

Without conflict resolution, the consumer applies updates out of order, ending with stale data. Strategies include last writer wins using a logical clock (like a Lamport timestamp or vector clock), explicit version numbers that increment on each change, or operational transformation that merges concurrent updates. These are complex and often require application level logic, not just CDC infrastructure.

Long Running Transactions:

A transaction that holds locks for 30 seconds (maybe a large batch update) is invisible to CDC until it commits. When it finally commits, CDC emits all its changes in a burst, potentially thousands of events in a few milliseconds. This can overwhelm downstream systems that expect a steady rate.

Production systems handle this with rate limiting or buffering in the CDC connector, or by designing consumers to absorb temporary spikes. Monitoring transaction durations and alerting on long running transactions can also prevent surprises.

💡 Key Takeaways

✓Consumer lag causing Write Ahead Log (WAL) or binlog growth is the most critical failure: at 20 MB per second log generation, a 2 minute lag consumes 2.4 GB of disk, and if lag continues, disk exhaustion will stop the database entirely

✓CDC provides at least once delivery, so downstream consumers must be idempotent: upsert by primary key with version or timestamp checks, or deduplicate using source log position

✓Schema evolution requires backward compatible changes and careful sequencing: add columns as nullable, deploy consumers that handle both schemas, wait for lag to clear, then optionally enforce NOT NULL

✓Multi region or multi source CDC introduces ordering challenges: changes to the same entity from different regions can arrive out of order, requiring conflict resolution strategies like last writer wins with logical clocks or version numbers

✓Long running transactions appear as bursts of events when they commit, potentially overwhelming consumers: design consumers to handle spikes or rate limit event emission in the CDC connector

📌 Interview Tips

1A PostgreSQL primary with 100 GB free disk and logical replication enabled generates WAL at 15 MB per second. A downstream Kafka broker fails, causing CDC lag to grow. In 30 minutes, 27 GB of WAL accumulates. Monitoring alerts at 2 GB lag trigger automatic consumer scaling, catching up within 10 minutes before disk becomes critical.

2A data warehouse consumer receives duplicate CDC events after a connector restart. It uses UPSERT logic: INSERT ... ON CONFLICT (<code style="padding: 2px 6px; background: #f5f5f5; border: 1px solid #ddd; border-radius: 3px; font-family: monospace; font-size: 0.9em;">user_id</code>) DO UPDATE SET <code style="padding: 2px 6px; background: #f5f5f5; border: 1px solid #ddd; border-radius: 3px; font-family: monospace; font-size: 0.9em;">email</code> = EXCLUDED.<code style="padding: 2px 6px; background: #f5f5f5; border: 1px solid #ddd; border-radius: 3px; font-family: monospace; font-size: 0.9em;">email</code> WHERE EXCLUDED.<code style="padding: 2px 6px; background: #f5f5f5; border: 1px solid #ddd; border-radius: 3px; font-family: monospace; font-size: 0.9em;">updated_at</code> > current.<code style="padding: 2px 6px; background: #f5f5f5; border: 1px solid #ddd; border-radius: 3px; font-family: monospace; font-size: 0.9em;">updated_at</code>. Duplicates are safely ignored.

3A batch job updates 100,000 rows in a single transaction that takes 40 seconds. When the transaction commits, CDC emits 100,000 events in a burst. A downstream consumer configured for 5,000 events per second experiences a temporary spike to 20,000 events per second, triggering backpressure and a 10 second delay in processing.

← Back to Log-based CDC (Binlog, WAL) Overview