Change Data Capture (CDC) • CDC at Scale & PerformanceHard⏱️ ~3 min
CDC Consumer Patterns and Sink Optimization
The Consumer Challenge: You have a stream of 50,000 change events per second arriving with sub second latency. You need to write them into three different systems: a search index needing 1 second freshness, a cache needing immediate invalidation, and a data warehouse batching for 5 minute freshness. Each sink has different write characteristics and cost profiles.
Idempotent Upsert Pattern: Every downstream sink must handle duplicates gracefully. CDC guarantees at least once delivery: during failover or retries, you may see the same change event twice. Your sink applies an idempotent upsert using a composite key of table name, primary key, and log sequence number. Before applying a change, check if you have already processed a higher sequence number for that primary key. If yes, skip. This makes your sink safe for replays.
Micro Batching for High Overhead Sinks: Writing to a data warehouse one row at a time is prohibitively expensive. Network round trips and transaction overhead dominate. Instead, batch events into groups of 10,000 to 100,000 rows or 1 to 5 minutes of data, whichever comes first. Write each batch as a single Parquet file to cloud storage. This reduces write operations by 10,000x while keeping freshness within 5 to 10 minutes.
The trade off is memory and complexity. Your consumer buffers events in memory, tracks batch boundaries, and handles partial batch failures. If your process crashes mid batch, you replay from the last committed offset, potentially creating duplicate batches. The sink must deduplicate using file checksums or batch IDs.
Checkpointed Consumer Groups: Each consumer maintains its position in the CDC topic per partition. After successfully writing a batch to the sink, it commits its new offset. On restart, it resumes from the last committed position. This decouples consumer progress from other consumers: search can be at offset 1 million while the warehouse is at offset 500,000.
Consumer Lag Targets By Sink Type
1-2 sec
SEARCH INDEX
100-500ms
CACHE INVALIDATE
5-10 min
DATA WAREHOUSE
✓ In Practice: Netflix uses separate consumer groups for each sink type. Search indexing consumes events individually for low latency. Data lake ingestion buffers 5 minutes into large Parquet files for storage efficiency. They run independently on the same CDC stream.
Backpressure and Circuit Breaking: What if your search index is overloaded and taking 5 seconds per write instead of 50ms? Without backpressure, your consumer falls further and further behind, eventually hitting memory limits or topic retention. Production consumers implement circuit breakers: if sink latency exceeds a threshold for N consecutive attempts, pause consumption and alarm. This prevents cascading failures.
Operational Metrics: Monitor per consumer group lag in messages and time, write throughput to each sink, error rate per schema version, and p99 latency from event timestamp to sink visibility. At scale, you track these per partition. Uneven lag across partitions signals hot key problems or consumer imbalance.💡 Key Takeaways
✓Idempotent upsert using composite keys of table, primary key, and sequence number makes sinks safe for at least once delivery and replays
✓Micro batching into 10k to 100k row groups reduces warehouse write operations by 10,000x while keeping freshness within 5 to 10 minutes
✓Each consumer group checkpoints offsets per partition after successful writes, allowing independent progress across different sink types
✓Circuit breakers pause consumption when sink latency exceeds thresholds, preventing cascading failures and unbounded memory growth
✓Different sinks have different latency targets: search at 1 to 2 seconds, cache invalidation at 100 to 500ms, data warehouse at 5 to 10 minutes
📌 Examples
1A data warehouse consumer batches 100k events over 5 minutes into a single Parquet file, reducing from 100k write operations to 1 file write
2Netflix runs separate consumer groups where search processes events individually for 1 second lag while data lake buffers 5 minutes for storage efficiency