CDC Failure Modes and Edge Cases

The Hard Problems in Production: CDC systems fail in subtle ways that can cause silent data loss, duplication, or out of order processing. Understanding these failure modes is critical for building reliable pipelines and answering system design interview questions about correctness.

At Least Once Delivery and Idempotency: The most common failure pattern is the connector crash problem. A CDC connector reads 10,000 events from the database log, publishes them to Kafka, then crashes before committing its offset bookmark. On restart, it replays from the last committed offset, re-emitting those same 10,000 events.

This is why production CDC systems use at least once delivery semantics combined with idempotent consumers. Each CDC event carries a source offset or logical sequence number per primary key. Consumers store the last applied offset and ignore events with equal or lower offsets. For database sinks, you can use UPSERT operations with a condition like WHERE source_offset > last_applied_offset to achieve exactly once semantics.

Without idempotency, duplicate events corrupt aggregations. A payment processing system that counts transactions by summing CDC events would double count, causing financial discrepancies.

Log Retention and Backpressure: A critical edge case occurs when downstream consumers lag behind while the database purges old transaction logs. Suppose your data warehouse has an outage lasting 6 hours. CDC events accumulate in Kafka, but the connector's bookmark in the database log advances slowly.

If your database retains logs for only 4 hours, and the connector is 6 hours behind, you lose the ability to resume. The log segments containing the missed events are gone. You must either accept data loss or perform a full table rescan, which is expensive and creates load spikes.

Critical Lag Scenario
LOG RETENTION
24 hours
>
MAX LAG
6 hours

Production teams size log retention to at least 2x their maximum expected lag, often 24 to 48 hours. They also monitor lag metrics and trigger alerts when consumers fall behind by more than a threshold, for example 30 minutes for near real time pipelines.

Schema Evolution: Adding a nullable column to your database table is straightforward, but renaming a column or changing data types breaks downstream consumers that expect a fixed schema. A consumer reading user_email will fail if you rename it to email_address.

Systems like Debezium emit schema metadata alongside each event and integrate with schema registries like Confluent Schema Registry. Compatibility strategies include forward compatibility (old consumers can read new schemas) and backward compatibility (new consumers can read old schemas). In practice, you enforce additive only changes: new columns must be nullable or have defaults. Breaking changes require versioned topics or dual writing during migrations.

Ordering Across Tables and Transactions: Log based CDC preserves commit order within a single partition. But if you partition events by primary key, a transaction that updates both users and orders tables may emit events to different partitions, processed out of order by different consumer instances.

For example, an order placement transaction atomically inserts an order and decrements inventory. If the order event arrives before the inventory event due to partitioning, a materialized view might briefly show inconsistent state. Solutions include using transaction IDs and commit timestamps to reorder events at the consumer, or accepting eventual consistency for cross table aggregations.

❗ Remember: Multi table transactions in CDC often require consumers to use transaction metadata and commit timestamps to maintain consistency, or design the system to tolerate eventual consistency.
Operational Edge Cases: Primary database failovers introduce complexity. When a database fails over to a replica, the new primary's transaction log starts from a different position. CDC connectors must detect this and switch log positions without gaps or duplicates. Similarly, network partitions between the database and the message bus can cause connector retries and potential event reordering if not handled carefully.

Monitoring is essential: track lag per consumer, error rates, schema compatibility failures, and alert when lag exceeds SLAs (for example, 30 seconds for real time, 5 minutes for near real time). During incidents, you need tooling to pause, rewind, or fast forward consumers, and to replay specific time ranges after bug fixes.

💡 Key Takeaways

✓At least once delivery plus idempotent consumers using source offsets prevents duplicate processing after connector crashes and restarts

✓Log retention must exceed maximum expected consumer lag by at least 2x; 24 to 48 hours is typical to survive multi-hour downstream outages without data loss

✓Schema evolution requires compatibility strategies like additive only changes, schema registries, and versioned topics to avoid breaking consumers

✓Multi table transactions partitioned by primary key can be processed out of order; consumers must use transaction IDs and timestamps or tolerate eventual consistency

📌 Interview Tips

1CDC connector crashes after publishing 10,000 events but before committing offset, causing duplicate emissions on restart; consumers use <code>WHERE source_offset > last_applied</code> to deduplicate

2Data warehouse outage lasting 6 hours causes CDC lag to exceed 4 hour log retention, resulting in data loss; production systems use 24 hour retention to handle p99 outage durations

3Schema migration renames <code>user_email</code> to <code>email_address</code>, breaking 5 downstream consumers until dual write period completes and consumers are updated

← Back to CDC Fundamentals & Use Cases Overview