Change Data Capture (CDC)Timestamp-based CDCMedium⏱️ ~3 min

The Mechanics: Watermarks and Polling

The entire correctness of timestamp based CDC depends on careful watermark management. Understanding this mechanism deeply is critical for interviews, because subtle bugs here cause silent data loss. How Watermark Tracking Works: For each table being replicated, the CDC system stores a watermark timestamp in a durable checkpoint store (often a small metadata database or distributed key value store). This watermark represents the last successfully processed timestamp.
1
Read watermark: Fetch the last processed timestamp for this table, say 2024-01-15 10:30:00.
2
Query changes: Execute SELECT * FROM orders WHERE updated_at > '2024-01-15 10:30:00' ORDER BY updated_at, order_id LIMIT 100000.
3
Process batches: Ship the retrieved rows to Kafka, data warehouse, or search index, handling them idempotently.
4
Advance watermark: After all batches succeed, update the watermark to the highest updated_at seen, say 2024-01-15 10:32:15.
The Boundary Problem: Here is where it gets tricky. If multiple rows have identical timestamps, and you use a simple greater than comparison, you risk skipping rows at the boundary. Consider 5 orders all updated at exactly 10:30:00.500. If your watermark stores only millisecond precision, after processing these orders you set the watermark to 10:30:00.500. On the next run, you query for updated_at > 10:30:00.500, missing any other orders at that exact millisecond.
❗ Remember: Always use a composite ordering of timestamp plus primary key, and either use greater than or equal with deduplication downstream, or re-scan a small overlap window on each run.
Real World Configuration: A production CDC system for a table with 10 million daily updates might configure:
Typical CDC Parameters
60 sec
POLL INTERVAL
50K
BATCH SIZE
30 sec
P50 LATENCY
Polling every 60 seconds with batch size of 50,000 rows keeps database load manageable while achieving median freshness around 30 seconds for downstream consumers. If a single run exceeds the batch size, the job processes multiple batches sequentially, increasing latency for that cycle but preventing memory exhaustion. Handling High Churn: For tables with 50 million updates per day, even with an index on updated_at, repeatedly scanning for changes creates measurable database CPU load. At extreme scale (billions of rows, millions of daily changes), teams often partition the CDC process by key ranges, running independent pollers with separate watermarks per partition. This distributes load and allows parallel processing.
💡 Key Takeaways
Each table maintains a durable watermark timestamp representing the last successfully processed change, stored in a checkpoint system separate from the source database
The CDC query uses WHERE <code>updated_at</code> > watermark ORDER BY <code>updated_at</code>, <code>primary_key</code> to ensure consistent ordering and prevent skipping rows with identical timestamps
Typical production configuration: 60 second polling interval, 50,000 row batch size, achieving p50 data freshness of 30 seconds and p99 around 60 seconds
Boundary loss occurs when multiple rows share the exact watermark timestamp: solution is composite ordering by timestamp and primary key, or re-scanning a small overlap window
For high churn tables with millions of daily updates, partition the CDC process by key ranges with independent watermarks to distribute database load and enable parallel processing
📌 Examples
1A CDC job queries SELECT * FROM orders WHERE <code>updated_at</code> > '2024-01-15 10:30:00' ORDER BY <code>updated_at</code>, <code>order_id</code> LIMIT 50000, processes the batch, then advances watermark to the highest timestamp seen
2With 10 million daily updates and 60 second polling, each run finds approximately 7,000 changed rows (10M per day / 1,440 minutes), well under the 50,000 batch size limit
3An overlap strategy re-scans the last 5 seconds on every run: if watermark is 10:30:00, query for > 10:29:55, rely on idempotent upserts downstream to handle duplicates
← Back to Timestamp-based CDC Overview