Change Detection Mechanisms and Watermarks

The Challenge:

Incremental processing lives or dies by your ability to reliably detect what has changed. If you miss data, your analytics are wrong. If you reprocess too much, you waste money and blow latency budgets. Getting this right requires understanding three primary detection mechanisms and their trade offs.

Append Only Event Streams:

The cleanest pattern is append only logs like Kafka or AWS Kinesis. Every event gets a monotonically increasing offset or sequence number. Your consumer tracks the last successfully processed offset, commits it durably, and resumes from there on restart.

A ride sharing platform might generate 500k events per second at peak. Each consumer group maintains its offset per partition. If your ETL job crashes at offset 1,203,456,789 after processing 10 million events, it restarts from exactly that offset. No data loss, no duplication, assuming your downstream writes are idempotent.

Offset Based Progress Tracking
1.2B
LAST OFFSET
10M
NEW EVENTS
2 min
PROCESSING TIME
Change Data Capture from Databases:

For mutable data in OLTP systems, CDC monitors the database transaction log (Write Ahead Log in Postgres, binlog in MySQL) and emits insert, update, and delete events as a stream. Each change gets a log sequence number (LSN) and commit timestamp.

This turns mutable tables into append only change streams. A user updates their email? CDC emits an update event with before and after values. Your incremental pipeline consumes this event, identifies the user by primary key, and merges the change into your analytics warehouse. Companies like LinkedIn and Uber use Debezium or proprietary CDC systems to capture billions of database changes daily.

The catch: CDC introduces operational complexity. You must handle schema evolution, monitor replication lag (often kept under 5 seconds for real time use cases), and deal with large transactions that can stall the log.

Timestamp Based Watermarks:

When CDC is not available, many systems use timestamp columns. You track a high watermark, the maximum updated_at value processed so far. On the next run, query for rows where updated_at is greater than your watermark.

This is simple but dangerous. If you set the watermark to the current wall clock time and assume all updates arrive within 15 minutes, any delayed write due to network partition or backlog causes permanent data loss. The watermark advances past the delayed event, and future runs never see it.

❗ Remember: Timestamp based watermarks require buffer windows. Process up to (current time minus 30 minutes) to allow for late arrivals, then validate completeness with row counts or checksums.

💡 Key Takeaways

✓Append only streams with consumer offsets provide the strongest guarantees: exactly once semantics when combined with idempotent writes, zero data loss on failure

✓CDC converts mutable database tables into append only change logs, capturing inserts, updates, and deletes from transaction logs with sequence numbers and commit timestamps

✓Timestamp based watermarks are simpler but risky: delayed writes due to network issues can be permanently lost if watermark advances beyond a safe buffer window

✓At scale, systems often combine mechanisms: append only streams for events, CDC for transactional data, and timestamp columns with 30 minute lag buffers for batch sources

📌 Interview Tips

1Kafka consumer processing 500k events per second maintains offset per partition: if job crashes at offset 1,203,456,789, restarts from exact position with zero data loss

2LinkedIn uses CDC to capture 2 billion daily changes from production MySQL databases, emitting update events with before and after values, processed by incremental pipelines with sub 5 second lag

3E commerce platform queries orders table with WHERE <code>updated_at</code> > last_watermark, but uses (current_time minus 30 minutes) as cutoff to handle delayed database replicas

← Back to Incremental Processing Patterns Overview