Change Data Capture (CDC)Timestamp-based CDCHard⏱️ ~3 min

Failure Modes and Edge Cases

Timestamp based CDC fails in subtle and often silent ways. In system design interviews, demonstrating awareness of these edge cases shows deep understanding of distributed systems constraints. Boundary Loss from Timestamp Collisions: The most insidious failure mode happens at the watermark boundary. Suppose your database stores timestamps with microsecond precision, but your CDC system stores watermarks with millisecond precision. You process 10 orders at timestamp 14:22:35.742000 and advance your watermark to 14:22:35.742. Next run, you query for timestamps > 14:22:35.742. Any other orders written at 14:22:35.742xxx are skipped forever. This is data loss by precision mismatch, and it is silent. Your metrics show successful runs. Your row counts look reasonable. But you have permanently lost events. The fix requires either matching precision exactly between source and watermark, or using a composite key approach: order by updated_at, primary_key and track both in your watermark. This adds complexity but provides correctness. Clock Skew and Non-Monotonic Timestamps: If your application servers generate timestamps using local clocks, and those clocks are not perfectly synchronized, you can see timestamp ordering violations. Server A writes order 1001 at 14:22:36 (its clock is 2 seconds fast). Server B writes order 1002 at 14:22:35 (correct time). From a wall clock perspective, order 1002 happened after 1001, but the timestamps say otherwise. Your CDC system processes changes in timestamp order, so it may deliver 1001 before 1002 to downstream systems. For analytics, this might be acceptable. For systems maintaining derived state (like account balances), this can cause inconsistency.
❗ Remember: Database generated timestamps (like PostgreSQL NOW() or MySQL CURRENT_TIMESTAMP) are much safer than application generated timestamps, but they can still exhibit clock skew across replicas in multi-region deployments.
Missing Timestamp Updates: A developer adds a new code path that updates user profile data but forgets to touch the last_modified column. This update becomes invisible to CDC. You have created a silent divergence between source and replicas. This failure mode requires organizational discipline: code reviews, database triggers, or Object Relational Mapping (ORM) level hooks that automatically maintain timestamps. But every exception or batch update tool becomes a potential correctness hole. Long Running Transactions: Consider a transaction that updates 100,000 rows over 30 seconds. All rows receive the same commit timestamp. Your CDC poller runs every 60 seconds. In one polling cycle, it might see the first 50,000 rows. The next cycle sees the remaining 50,000. Downstream consumers see a partial, torn read of the transaction. For many analytics use cases, this is tolerable. For use cases requiring transactional consistency (like double-entry bookkeeping), timestamp based CDC provides no atomicity guarantees. Handling Deletes at Scale: Soft deletes work until they do not. If your users table has 500 million rows and 100 million are soft deleted, every CDC scan reads and filters those 100 million deleted rows. Your query plan scans the entire timestamp index, then filters by deleted_at IS NULL, causing performance degradation. The alternative, a separate delete events table, requires application code to write to two tables on every delete. This adds complexity and is another potential failure point.
Failure Timeline Example
T1
Normal ops
T2
Code deploy
T3
Silent loss
Detecting These Failures: Because many of these failures are silent, production systems need active reconciliation. Common strategies: Periodic full table checksums comparing source and destination row counts, possibly sampling a percentage of rows for deep comparison. Monitoring unexpected drops in CDC throughput. If you normally see 50,000 rows per run and suddenly see 500, either traffic dropped or something broke. End to end data quality tests that write known test records to production, then verify they appear downstream within expected latency bounds. These safety mechanisms add operational complexity, but without them, you have silent data drift.
💡 Key Takeaways
Boundary loss occurs when watermark precision mismatches source timestamp precision: storing millisecond watermarks with microsecond source timestamps causes permanent event loss at exact boundary values
Clock skew between application servers generating timestamps creates non-monotonic ordering: event written at 14:22:36 on fast clock can appear before event at 14:22:35 on accurate clock, violating ordering assumptions
Missing timestamp updates from forgotten code paths cause silent divergence: developer adds update without touching <code>last_modified</code>, creating invisible changes that never replicate downstream
Long running transactions updating 100,000 rows create torn reads: CDC poller may see partial transaction across multiple polling cycles, violating atomicity for downstream consumers
Soft delete tables with 100 million deleted rows among 500 million total cause performance degradation: every CDC scan must read entire index then filter, dramatically slowing queries as delete ratio grows
📌 Examples
1A payment system lost 0.01 percent of transactions due to microsecond precision mismatch: watermark stored milliseconds while PostgreSQL timestamps used microseconds, silently skipping roughly 100 transactions per day at exact boundary milliseconds
2An inventory system experienced ordering violations during deployment: old servers with clocks 3 seconds fast wrote later events with earlier timestamps, causing occasional stock count mismatches in downstream analytics
3A user profile CDC missed 5 percent of email updates for 2 weeks: bulk email change tool directly updated database without triggering <code>updated_at</code> column, discovered only after customer complaints about stale emails in marketing system
← Back to Timestamp-based CDC Overview
Failure Modes and Edge Cases | Timestamp-based CDC - System Overflow