CDC Failure Modes and Edge Cases

When Consistency Guarantees Break:

Theoretical CDC consistency is clean: transactional capture, ordered delivery, idempotent consumption. Production CDC is messy: log retention windows expire, schemas evolve incompatibly, clocks skew, and poison pill events crash consumers. Understanding these failure modes separates candidates who've built real systems from those who've only read documentation.

Log Truncation and the Unrecoverable Gap:

Databases don't retain transaction logs forever. PostgreSQL's Write Ahead Log typically keeps 16 MB segments with retention based on replication lag, MySQL binary logs might retain 7 days. If your CDC connector falls behind by more than the retention window, the database deletes log segments you haven't processed yet.

The math is unforgiving. A database generating 500 MB per hour of log data with 7 day retention gives you 84 GB of buffer. If a CDC outage lasts 8 days, you've lost 24 hours of changes permanently. The only recovery is a full resnapshot, which for a 10 TB database takes 12 to 18 hours and creates a consistency gap during the cutover.

Production systems combat this with aggressive monitoring. They track replication lag in both time (minutes behind) and position (LSN delta). Alerts fire at 15 minute lag, pages fire at 30 minute lag, and runbooks trigger emergency capacity increases or snapshot preparation at 2 hour lag, all well before hitting the 7 day retention limit.

Retention Failure Timeline
T0: NORMAL
Lag: 2 min
→
T1: OUTAGE
Lag: 8 days
→
T2: TRUNCATED
Data loss
Schema Evolution and Breaking Changes:

A developer renames a column from user_email to email_address in the source database. CDC events suddenly contain email_address instead of user_email. Consumers expecting the old schema crash or silently drop the field.

Without schema governance, this breaks consistency. Consumers that inserted records successfully before the change now fail on every event, stalling the entire partition. Or worse, they succeed but write NULL for the email field, corrupting downstream data.

Mature CDC systems use schema registries with compatibility enforcement. Before publishing an event with a new schema version, the producer validates compatibility (backward, forward, or full). Backward compatibility means old consumers can read new events. Forward compatibility means new consumers can read old events. The registry rejects incompatible changes, forcing coordinated rollouts.

❗ Remember: Adding columns is usually safe (backward compatible). Removing or renaming columns breaks old consumers unless you deploy schema compatible consumers first, then update producers. This requires coordination windows where both schemas are valid.
Poison Pill Events and Stalled Partitions:

A malformed event enters the stream: perhaps a bug in the CDC connector serializes a NULL value as an invalid JSON string, or a database trigger writes a 50 MB TEXT field that exceeds consumer memory limits. The first consumer to read this event crashes. It restarts, reads the same event, crashes again. The partition is stuck.

With strong ordering guarantees, you cannot skip this event without violating "no data loss." But you also cannot make progress. Production systems handle this with dead letter queues: after N failed attempts (typically 3 to 5), the consumer writes the problematic event to a quarantine topic, logs detailed diagnostics, and continues processing from the next event. Operators manually review quarantined events to fix bugs or data corruption.

Clock Skew and Temporal Semantics:

CDC ordering is based on commit sequence (LSN), not wall clock time. But downstream systems often use timestamps for windowing or late event handling. If the source database's clock is 10 minutes fast, events arrive with future timestamps. If a fraud detection system uses tumbling windows based on event time, these future timestamped events land in the wrong windows.

This isn't a CDC bug, it's a temporal semantics mismatch. Fixing it requires watermarking strategies: downstream systems track the maximum timestamp seen per partition and reject or quarantine events with timestamps more than some threshold (like 5 minutes) ahead of the watermark. Or they switch to processing time semantics, using arrival time instead of event time.

💡 Key Takeaways

✓Log truncation creates unrecoverable data loss when CDC lag exceeds retention windows, requiring aggressive monitoring with alerts at 15 minute lag thresholds

✓Schema evolution without compatibility enforcement breaks consumers silently, requiring schema registries that validate backward or forward compatibility before accepting changes

✓Poison pill events stall partitions indefinitely under strict ordering, requiring dead letter queues that quarantine malformed events after 3 to 5 retry attempts

✓Clock skew between source and consumers causes timestamp based windowing errors, requiring watermarking strategies that reject events with timestamps more than 5 minutes ahead of observed maximums

✓Production CDC systems at Netflix and Uber handle billions of events per day with sub 0.0001 percent error rates by combining monitoring, schema governance, and graceful degradation patterns

📌 Interview Tips

1A CDC connector processing MySQL binlogs falls behind during a 10 hour database migration. The source database has 7 day retention. Lag reaches 6 days before operators notice. They immediately trigger a parallel snapshot job while the connector catches up, completing it in 18 hours. The resnapshot covers the gap, but downstream systems see a 20 hour consistency window where some data was stale.

2A microservice updates its user table schema, dropping the legacy_id column. CDC events suddenly omit this field. A downstream cache service that relied on legacy_id for backwards compatibility starts writing NULL values, breaking old API clients. The team rolls back the schema change, adds legacy_id back as a computed column, and coordinates a two phase migration over 3 weeks.

3An e-commerce CDC pipeline processing order events encounters a poison pill: an order with a 100 MB product description that exceeds Kafka message size limits. The consumer crashes on every read attempt. After 5 failures, the event moves to a dead letter topic. Operators investigate, discover the oversized field, patch the CDC connector to truncate large text fields to 1 MB, and manually reprocess the quarantined event.

← Back to CDC Data Consistency Guarantees Overview