Change Data Capture (CDC)Timestamp-based CDCEasy⏱️ ~3 min

What is Timestamp-based CDC?

Definition
Change Data Capture (CDC) is a pattern for tracking and moving only the changes (inserts, updates, deletes) from a source database to downstream systems, rather than copying entire tables repeatedly.
Timestamp based CDC solves a fundamental data engineering problem: how do you keep a data warehouse, search index, or cache synchronized with your production database without overwhelming it? The Core Problem: Imagine you have a 2 TB e-commerce database with 500 million product and order records. Your analytics team needs fresh data every few minutes for dashboards. Copying the entire database repeatedly is impossible. Scanning 500 million rows every 5 minutes would crush your production database and take hours to complete. How Timestamp Based CDC Works: Every table includes a timestamp column like updated_at or last_modified. Your database automatically updates this timestamp whenever a row is created or modified. The CDC system maintains a "watermark" representing the last time it checked for changes.
How It Reduces Work
FULL SCAN
500M rows
CDC SCAN
50K rows
On each run (typically every 30 seconds to 5 minutes), the CDC job queries: "Give me all rows where updated_at is greater than my watermark." It might find only 50,000 changed rows instead of scanning all 500 million. After processing those changes, it updates the watermark to the newest timestamp it saw. What Makes It Different: This is a pull based, polling approach. The CDC system periodically asks "what changed?" rather than listening to a continuous stream of database events. It works with any database that supports timestamp columns, including systems that do not expose transaction logs. This simplicity makes it the most accessible form of CDC, though it comes with important limitations around latency and correctness that we will explore in later cards.
💡 Key Takeaways
CDC moves only incremental changes instead of full table copies, reducing load from scanning 500 million rows to processing perhaps 50,000 changed rows per run
Timestamp based CDC queries for rows where a timestamp column like <code>updated_at</code> exceeds a stored watermark, which advances after each successful run
This is a pull based polling pattern that runs periodically (every 30 seconds to 5 minutes), making data freshness tied to polling frequency
Works across any database supporting timestamp columns, including systems without transaction log access, making it the most portable CDC approach
Deletes require special handling through soft delete flags or separate audit tables, since hard deletes leave no timestamp trace
📌 Examples
1An e-commerce platform with 10,000 writes per second uses timestamp based CDC running every 2 minutes to sync order data to BigQuery, achieving p50 freshness of 1 minute for analytics dashboards
2A user profile database with a <code>last_modified</code> column on every table runs CDC every 30 seconds, finding typically 20,000 to 100,000 updated profiles per batch to replicate to a search index
3A soft delete implementation adds <code>deleted_at</code> timestamp column: rows with non null <code>deleted_at</code> are treated as deletes by the CDC poller and marked inactive downstream
← Back to Timestamp-based CDC Overview