CDC Capacity Planning and Flow Control

Sizing a CDC pipeline requires translating your change rate into concrete resource requirements across the entire flow: log generation, capture, transport, processing, and sink. Start with the fundamentals: event size multiplied by events per second determines bandwidth. If your events average 500 bytes and you expect 50,000 events per second at peak, you need 25 megabytes per second of sustained throughput. Add 2x headroom for spikes and rebalances, bringing your requirement to 50 megabytes per second.

Broker limits often become the bottleneck. Kinesis shards provide approximately 1 megabyte per second and 1,000 records per second of write capacity per shard. For 50,000 events per second, the record rate limit dominates: you need at least 50 shards just for the record count, even though bandwidth only requires 25 shards. Always take the maximum of bandwidth driven and record driven shard counts. AWS Database Migration Service (DMS) commonly handles tens of thousands of row changes per second with sub second replication lag, but only if you provision enough target capacity.

Log retention planning prevents catastrophic gaps. If your consumer falls behind longer than the source's log retention window, required logs will be purged and you'll have irreversible data loss. At 30 megabytes per second change volume, 24 hours of retention requires approximately 2.6 terabytes of log storage. If you only provision 512 gigabytes, you have just 4.7 hours of headroom before purges begin. Monitor lag in both time (seconds since commit) and position (bytes or LSNs behind head).

End to end latency budgets must account for every stage. For sub 1 second pipelines: capture should be under 100 milliseconds, broker under 200 milliseconds, stream processing under 300 milliseconds, and sink write under 300 milliseconds under normal load. Netflix's Kafka based streaming achieves hundreds of milliseconds to seconds with trillions of messages per day by carefully budgeting each stage and using micro batches only where needed to bound tail latencies.

💡 Key Takeaways

✓Calculate bandwidth as event size times events per second. At 500 bytes and 50,000 events per second, you need 25 megabytes per second sustained, plus 2x headroom for spikes

✓Kinesis shards provide approximately 1 megabyte per second and 1,000 records per second write capacity. Take the maximum of bandwidth driven (25) and record driven (50) shard counts

✓Log retention must cover worst case consumer outages. At 30 megabytes per second, 24 hours requires approximately 2.6 terabytes. With 512 gigabytes, you have only 4.7 hours before purges

✓End to end sub 1 second latency requires careful stage budgeting: capture under 100 milliseconds, broker under 200 milliseconds, processing under 300 milliseconds, sink under 300 milliseconds

✓Monitor lag in both time (seconds since commit) and position (bytes or LSNs behind). Position lag tells you how close to log purge you are, time lag tells you data freshness

✓Consumer processing rate must exceed peak production rate. If consumers handle 2,500 events per second per partition and peak is 50,000 events per second, you need at least 20 partitions minimum

📌 Interview Tips

1AWS DMS steady state: Handles tens of thousands of row changes per second with sub second to low second replication lag when target can keep up and source logs are retained sufficiently

2Netflix Kafka streaming: Trillions of messages per day with typical latencies of hundreds of milliseconds to seconds by budgeting each pipeline stage

3Uber MySQL binlog CDC: Millions of messages per second across Kafka clusters with sub second to low second latencies under normal operation with proper capacity provisioning

← Back to Change Data Capture (CDC) Overview