Snapshot Plus Stream Consistency Pattern

The Bootstrap Problem:

When you add a new consumer to an existing CDC pipeline, you face a cold start problem. Your database already contains 500 million rows of historical data, but your CDC stream only captures new changes going forward. How do you initialize the consumer with existing data while ensuring you don't miss changes that happen during the initialization?

The naive approach fails catastrophically. If you snapshot the database at time T1, then start consuming the CDC stream from time T2 several hours later (after the snapshot completes), you miss all changes between T1 and T2. Conversely, if you start consuming at T1 then load the snapshot, you double count all changes between T1 and T2.

The Coordinated Cutover Solution:

Production CDC systems solve this with snapshot plus stream coordination. The key insight is binding the snapshot to a specific log position, then consuming from exactly that position forward.

The sequence works like this:

First, you initiate a consistent snapshot of the source database and record the exact LSN or binlog position where that snapshot represents the database state. Modern databases support this: PostgreSQL snapshots can be tied to a transaction ID, MySQL snapshots to a binlog coordinate.

Second, you load this snapshot data into the consumer. For a large table with 2 billion rows at 5 KB average row size, this might be 10 TB of data taking 6 to 12 hours to transfer and load.

Third, and critically, you start CDC consumption from the exact LSN recorded at snapshot time. Every change committed after that LSN flows through the stream.

Consistency Timeline
T0: START
Snapshot @ LSN 5M
→
T0 to T6
Load 10 TB
→
T6: STREAM
Resume @ LSN 5M
Handling Very Large Tables:

For tables with billions of rows, even a coordinated snapshot can take too long. Production systems parallelize by key range. You might divide a user table into 100 ranges by user_id modulo 100, snapshot each range in parallel, but all ranges reference the same LSN.

This cuts snapshot time from 12 hours to 20 minutes using 100 parallel workers, while maintaining the same consistency guarantee: every change is seen either in one of the 100 range snapshots or in the stream from the shared LSN forward.

✓ In Practice: LinkedIn's Brooklin and Netflix's DBLog both implement snapshot plus stream patterns with LSN coordination. For their largest tables (user profiles, viewing history), they use parallelized range snapshots completing in under 30 minutes while maintaining zero gap consistency.
Validation and Monitoring:

Even with correct implementation, production teams validate consistency. After snapshot plus stream completes, they run reconciliation jobs comparing row counts and checksums between source and sink. Acceptable mismatch rates are typically under 0.0001 percent, with any discrepancy triggering investigation.

Replication lag monitoring becomes critical during backfill. If snapshot load saturates the consumer and causes CDC lag to grow beyond 15 to 30 minutes, you risk log retention issues where the source database truncates log segments before the consumer catches up.

💡 Key Takeaways

✓Snapshot plus stream coordination binds the snapshot to a specific LSN, then consumes CDC from exactly that position to ensure every change appears once

✓Naive approaches that snapshot at T1 and stream from T2 create gaps, while streaming from T1 then snapshotting creates duplicates

✓Parallelizing snapshots by key range reduces 12 hour backfills to 20 to 30 minutes while maintaining the same LSN based consistency

✓Production systems at LinkedIn and Netflix use this pattern for tables with billions of rows, validating with sub 0.0001 percent mismatch tolerance

✓Replication lag during backfill must stay under log retention windows, typically monitored to stay below 15 to 30 minute thresholds

📌 Interview Tips

1A data warehouse team adds a new fact table consuming from a 2 billion row orders table. They initiate a PostgreSQL snapshot at transaction ID 87654321, export 10 TB to S3, load it into Snowflake over 8 hours, then start CDC consumption from transaction ID 87654321. All orders committed after that ID flow through CDC without gaps or duplicates.

2An Elasticsearch cluster needs to index 500 million product records. The team splits by product_id into 50 ranges, snapshots all 50 in parallel referencing binlog position mysql-bin.003847:29384719, completes the parallel load in 25 minutes, then starts CDC from that exact position.

← Back to CDC Data Consistency Guarantees Overview