CDC Failure Modes and Recovery at Scale

The Biggest Risk: Falling Behind: Your database produces 50,000 changes per second. Your CDC capture normally processes 55,000 per second. But a network hiccup slows it to 30,000 per second for 2 hours. You now have a backlog of 144 million unprocessed changes. If your database only retains logs for 48 hours, and recovery takes 3 hours, you hit log truncation and permanently lose data.

In production, this is the nightmare scenario. You monitor two lag metrics: log sequence number lag (how many transactions behind) and time lag (how many seconds behind). Set database log retention to 3x your maximum expected catch up time. If normal lag is under 1 second and worst case recovery is 6 hours, retain logs for at least 24 hours with alerts at 12 hours.

Schema Evolution Breaking Consumers: A developer adds a column to the orders table. The CDC capture includes this new field in change events, serialized with schema version 47. But your search indexing consumer was built for schema version 45 and crashes on unknown fields. Changes pile up unprocessed.

1
Schema Registry: Store all schemas with version IDs. Each change event includes its schema ID.
2
Forward Compatibility: Consumers ignore unknown fields by default, allowing new columns without breaking.
3
Gradual Rollout: Deploy schema changes to CDC first, verify consumers handle them, then enable in application code.
Ordering Violations During Failover: Your active CDC capture instance dies. A standby takes over, resuming from the last checkpointed offset at log sequence 1,000,000. But the crashed instance had already published events up to sequence 1,000,050 before dying. The standby replays 1,000,000 to 1,000,050, creating duplicates. Worse, if offset tracking is slightly wrong, you might replay 999,990 to 1,000,050, causing consumers to see events out of order for keys in that range.

The solution is idempotent consumers. Each change event includes a unique identifier (table name, primary key, and log sequence number). Consumers track the highest sequence number they have seen per primary key and ignore any event with a lower or equal sequence. This makes replays safe.

❗ Large Transaction Spike: A batch job updates 5 million rows in one transaction. The commit appears atomically, but CDC emits 5 million change events, overwhelming one partition if they share a key range. Result: that partition lags minutes behind while others stay current.
Network Partitions and Partial Failures: What if CDC capture can reach the database but not the distributed log? It cannot publish changes. Options: buffer in memory (risks out of memory), write to local disk (adds latency and complexity), or block (risks falling behind). Most production systems choose buffering with circuit breakers: if the buffer exceeds a threshold, stop reading from the database log and alarm.

Hot Key Edge Case: A single popular entity (viral post, flash sale item) receives 5,000 updates per second while average is 50 per key. That entity's partition becomes a bottleneck. One partition at 5,000 writes per second hits capacity limits, causing lag for all keys in that partition. Some systems detect hot keys and temporarily route them to a separate high throughput partition.

💡 Key Takeaways

✓Falling behind is the critical failure mode: at 50k writes per second, a 2 hour slowdown creates 144 million event backlog requiring hours to recover

✓Set database log retention to 3x maximum expected recovery time and monitor both sequence number lag and time lag with alerts

✓Schema evolution requires a registry with version IDs, forward compatible consumers that ignore unknown fields, and gradual rollout to verify compatibility

✓Idempotent consumers using sequence numbers per primary key make failover safe even when offsets cause duplicate or slightly out of order replays

✓Large transactions (5 million rows) and hot keys (5k writes per second to one entity) can overwhelm single partitions, requiring special handling or detection

📌 Interview Tips

1A CDC system with 1 second normal lag and 6 hour worst case recovery retains database logs for 24 hours with alerts at 12 hours to prevent data loss

2During a schema change adding a column, the search consumer with forward compatibility ignores the new field and continues processing while an analytics consumer waits for schema version upgrade

← Back to CDC at Scale & Performance Overview