CDC Failure Modes and Edge Cases

CDC pipelines fail in subtle and catastrophic ways if you don't plan for lag, schema evolution, and idempotency. The most severe failure is consumer lag exceeding log retention, causing irreversible data loss. If your consumer falls behind longer than the source retains logs, required commit log entries will be purged and you cannot recover the gap. You must either re snapshot (which may take hours or days for large tables) or accept permanent data loss. Always monitor lag in bytes or LSNs behind the head, not just time, because log generation rate varies.

Large transactions create burst amplification and out of order observations. A single transaction touching millions of rows may be emitted as a long burst of events that spikes downstream backpressure. For example, a bulk update across 5 million rows generates 5 million change events in rapid succession. If your consumer can handle 10,000 events per second, this creates a 500 second backlog. While commit order is preserved, per table or per partition interleaving can surprise consumers expecting strict serial updates across related entities. Apply per key ordering and backpressure aware batching to absorb spikes.

Schema evolution breaks pipelines when CDC delivers Data Definition Language (DDL) changes out of band or not at all. A consumer expecting 5 columns suddenly receives 6, or a column type changes from integer to string, causing deserialization failures and pipeline stalls. Some CDC implementations miss schema changes entirely. Use a schema registry with versioning, enforce backward and forward compatibility rules, and gate deployments to ensure both producers and consumers handle the new schema before changes appear in the stream.

Cross region replication with CDC typically uses last writer wins conflict resolution based on timestamps. Clock skew between regions can cause older updates to incorrectly overwrite newer ones, silently losing data. For example, if region A's clock is 2 seconds ahead and region B writes at true time T, then region A writes at true time T plus 1 second, region B's later write may be rejected because its timestamp appears older. Use monotonic version counters or hybrid logical clocks instead of wall clock timestamps to reduce this risk.

💡 Key Takeaways

✓Consumer lag exceeding log retention causes irreversible data loss. At 30 megabytes per second with 512 gigabytes retention, you have only 4.7 hours of headroom before purges begin

✓Large transactions create burst amplification. A 5 million row bulk update emits 5 million events; at 10,000 events per second consumer rate, this creates a 500 second backlog spike

✓Schema evolution failures occur when DDL changes arrive out of band or are missed entirely. Use schema registry with versioning and enforce compatibility before deploying changes

✓Trigger based CDC can pollute buffer caches by evicting hot pages, increasing latency for primary workload. Prefer log based CDC to avoid touching base tables during capture

✓Cross region last writer wins with timestamp based conflict resolution silently loses data when clock skew exists. Use monotonic version counters or hybrid logical clocks instead

✓Snapshot to CDC cutover gaps occur if handoff is imprecise. Always snapshot at a precise log position and start CDC from the next position to avoid duplicates or missing rows

📌 Interview Tips

1DynamoDB Global Tables conflict resolution: Uses last writer wins based on timestamps, requiring accurate clock synchronization and idempotent updates to prevent data loss from skew

2Incorrect snapshot handoff: Snapshot at time T, CDC from earlier position → Duplicates. CDC from later position → Missing changes during snapshot window

3Broker message size limits: Large rows with before and after images can exceed Kafka's default 1 megabyte limit, causing drops. Use column filtering or payload chunking

← Back to Change Data Capture (CDC) Overview