Definition
Change Data Capture (CDC) is a pattern where you stream all inserts, updates, and deletes from a primary database into other downstream systems, capturing what changed rather than periodically querying for current state.
The Core Problem: Your ecommerce site has an orders database that serves customer queries. But you also need that same data in your search index, analytics warehouse, and recommendation cache. How do you keep them all synchronized without overloading the primary database?
The naive approach is periodic full table scans: every hour, query the entire orders table and update downstream systems. This breaks catastrophically at scale. A table with 100 million orders takes 30 minutes to scan, overloading the Online Transaction Processing (OLTP) database during peak hours. You also miss near real time requirements and cannot easily tell what changed between scans.
How CDC Solves This: Instead of querying tables, CDC reads the database's internal commit log or write ahead log (WAL). Every database transaction appends records to this log before marking the transaction as committed. CDC tails this log continuously, converts log entries into structured change events, and publishes them to a durable event stream.
Think of it like following a ledger: the database writes "order 12345 created", "order 12345 status updated to shipped", and CDC captures each change as it happens. Downstream consumers subscribe to this stream and update their systems independently.
Why This Matters: CDC decouples your source of truth from derived systems. The primary database only worries about serving user traffic with low latency. Everything else consumes changes asynchronously. At companies like Netflix and Uber, CDC pipelines process hundreds of thousands of changes per second, keeping search indexes fresh within seconds and data warehouses updated within minutes, all without impacting customer facing query performance.
✓CDC captures database changes by reading the commit log rather than querying tables, avoiding overload on the primary database
✓Log based CDC preserves the exact order and timing of changes, something periodic full table scans cannot provide
✓Changes flow through a durable event stream where multiple independent consumers can subscribe without affecting each other
✓At scale, CDC enables near real time propagation with p50 latency typically under 500ms from commit to downstream visibility