Production Reality: Running CDC at Scale

Scale Numbers from Real Systems:

At companies like Meta and LinkedIn, primary MySQL clusters handling user facing traffic can generate 50,000 write transactions per second during peak hours. With an average of 2 to 3 row changes per transaction, that is roughly 100,000 to 150,000 individual change events per second. The binlog stream at this rate produces 15 to 25 MB per second of raw log data.

A CDC connector processing this stream typically runs on dedicated infrastructure separate from the primary database. The connector reads the binlog, transforms entries into structured events, and publishes to a Kafka cluster. With proper partitioning (say, 100 partitions per table topic), each partition handles around 1,000 to 1,500 events per second, keeping broker latency under 50 milliseconds at the 99th percentile.

Typical Production Metrics
100k-150k
EVENTS/SEC
1-2 sec
P99 LATENCY
99.9%
AVAILABILITY
Architecture Patterns:

The common architecture is a data platform with three layers. The primary OLTP database layer writes to its transaction log. The CDC layer consists of one or more connector processes that tail the log and publish events to a messaging layer (Kafka, Kinesis, or internal equivalents). The consumer layer includes multiple independent services: search indexing (Elasticsearch or OpenSearch), real time analytics (Flink or Spark Streaming writing to data warehouses), cache invalidation (Redis updates within 100 to 300 ms of database changes), and microservice materialized views.

Topics are typically organized one per source table. For a table like user_profiles, you might have a topic user_profiles_cdc with 50 to 200 partitions keyed by user ID. This ensures all changes for user 123 go to the same partition and are consumed in order. Downstream, each consumer group can process partitions in parallel while maintaining per entity ordering.

Operational Challenges:

The biggest operational concern is lag. If consumers slow down or the CDC connector hiccups while the database continues writing at 50,000 transactions per second, the transaction log grows faster than it is consumed. In PostgreSQL with logical replication, unconsumed WAL segments cannot be recycled, inflating disk usage on the primary. If disk fills up, the database can become unavailable, impacting the application.

Production systems monitor replication lag by comparing the current log position to the position the CDC connector has consumed. Alerts trigger when lag exceeds thresholds like 5 minutes or 1 GB of unconsumed log. Remediation can involve scaling out consumer capacity, throttling non critical consumers, or temporarily prioritizing critical paths.

⚠️ Common Pitfall: A slow consumer in one consumer group does not block other groups, but a slow CDC connector blocks everyone. This is why connector reliability and performance are critical, and many teams deploy connectors in active/standby or active/active configurations with automatic failover.
Decoupling Producers and Consumers:

One of the main benefits at scale is decoupling. Different teams can subscribe to CDC topics without coordinating with the database team or understanding database internals. A new analytics use case does not require schema changes or new read queries on the primary. Teams simply subscribe to the topics they need. This improves observability (every consumer's lag and throughput is visible in the messaging system) and enables replay: if a bug corrupts derived data, you can rewind to a historical log position and rebuild the state.

💡 Key Takeaways

✓Production systems at companies like LinkedIn and Meta process 100,000 to 150,000 CDC events per second with end to end latency (database commit to consumer processing) under 1 to 2 seconds at the 99th percentile

✓One topic per table with 50 to 200 partitions keyed by primary key maintains per entity ordering while allowing parallel processing across consumers

✓The biggest operational risk is lag: if the CDC connector or consumers cannot keep up, the transaction log grows unbounded, risking disk exhaustion on the primary database and application downtime

✓Monitoring replication lag in terms of log position offset and time behind is critical, with alerts at thresholds like 5 minutes or 1 GB and automated remediation to scale consumers or throttle non critical workloads

✓CDC decouples data producers and consumers: new use cases subscribe to existing topics without touching the primary database, and teams can replay from historical log positions to rebuild derived state after bugs or schema changes

📌 Interview Tips

1A MySQL cluster at 50,000 writes per second generates 20 MB per second of binlog data. A CDC connector publishes to Kafka topics with 100 partitions each handling 1,500 events per second. Downstream, a cache invalidation consumer processes updates within 200 milliseconds, a search indexer within 1 second, and an analytics pipeline within 5 seconds.

2During a deployment bug, a consumer writes corrupted data to a data warehouse for 30 minutes. The team identifies the issue, fixes the bug, rewinds the consumer to the log position from before the incident, and replays 90 million events over 2 hours to rebuild correct state.

← Back to Log-based CDC (Binlog, WAL) Overview