How Change Detection Works in Incremental Loads

The Change Detection Challenge:
Incremental loads need a reliable way to answer: what changed since my last run? The mechanism you choose determines correctness, performance, and operational complexity.

Three Primary Approaches:

First, timestamp based detection uses a last_modified_at or updated_at column. Your pipeline tracks a watermark (the maximum timestamp processed) and each run queries for rows where timestamp exceeds this watermark. Simple and low overhead. The catch: you miss deletes entirely unless you implement soft deletes with a deleted_at flag. Clock skew can also bite you: if a source server writes a timestamp 10 seconds in the past due to clock drift, and your watermark is already ahead, you permanently miss that update.

Second, sequence based detection relies on monotonically increasing IDs like auto increment primary keys or database sequence numbers. Query for IDs greater than your last processed ID. This works well for append only tables like event logs. But for tables with updates, you still need a separate mechanism to detect which existing rows were modified.

Third, Change Data Capture reads the source database transaction log (binlog in MySQL, Write Ahead Log in Postgres). CDC captures every insert, update, and delete operation as it commits, without querying the source tables. This is powerful because it gives you complete change history including deletes, and it does not add query load to production OLTP databases.

CDC Advantages at Scale
Zero
OLTP QUERY LOAD
100%
CAPTURE DELETES
Real World Example:
Consider a high traffic orders table receiving 10,000 writes per second. Timestamp based queries like SELECT * WHERE updated_at > watermark add index scan overhead to your production database. CDC taps into the existing transaction log stream without adding query load, making it the preferred choice for high Queries Per Second (QPS) systems.

⚠️ Common Pitfall: Timestamp based detection fails silently with clock skew. A source server writing timestamps 30 seconds in the past will have those updates permanently missed if your watermark advances past them. Always validate clock synchronization or add lookback windows.
State Management:
Every incremental pipeline needs a state store to track progress. After processing changes and writing to the target, you atomically update the watermark (timestamp, sequence ID, or log offset). If the job fails mid run, you restart from this checkpoint. At scale, companies store watermarks in metadata tables or distributed coordination services like ZooKeeper, ensuring exactly once processing semantics even across retries.

💡 Key Takeaways

✓Timestamp based detection queries <code>updated_at > watermark</code> but misses hard deletes and is vulnerable to clock skew between source servers

✓Change Data Capture reads transaction logs (binlog, WAL) to capture all inserts, updates, and deletes without adding query load to production databases

✓Watermark state must be stored atomically with target writes to enable exactly once processing semantics across failures and retries

✓CDC is preferred for high QPS systems (over 10,000 writes per second) where incremental queries would add unacceptable index scan overhead

📌 Interview Tips

1Orders table receiving 10,000 writes/second: timestamp queries add overhead, CDC taps transaction log with zero query impact

2Clock skew scenario: source server writes timestamp 30 seconds in past while watermark is current time, causing permanent data loss in timestamp based approach

← Back to Full Refresh vs Incremental Loads Overview