Production Scale SCD: Change Detection and Processing

The Ingestion Challenge:

At Amazon or Walmart scale, operational databases generate 50,000 to 200,000 change events per minute across all entities. Customer profiles, addresses, product attributes, and store configurations change constantly. Your SCD pipeline must detect these changes, apply versioning logic, and land updated dimension tables, often within 5 to 15 minute Service Level Agreements (SLAs) for near real time analytics.

Two fundamental approaches exist: snapshot based comparison and Change Data Capture (CDC).

Snapshot Based Detection:

You extract the full current state from source systems daily or hourly, then compare against your warehouse dimension to find differences. For a 500 million product catalog, this means comparing every attribute of 500 million rows to detect maybe 2 million changes.

This is conceptually simple but computationally brutal. Comparing 500 million rows can take 4 to 8 hours of compute time even with distributed processing. At Target scale with millions of daily price changes, snapshot comparison becomes a bottleneck. False positives from transient fields like load timestamps can create millions of spurious version rows, exploding your Type 2 dimensions.

The algorithm is straightforward. Hash all business attributes for each business key in both old and new snapshots. Compare hashes. Where they differ, apply SCD logic. For Type 2, this means: look up current version by business key, set its effective end to now and current flag to false, insert new version with new surrogate key and effective start of now.

CDC Based Processing:

Change Data Capture reads the transaction log from operational databases and streams only the changed rows. Instead of comparing 500 million products to find 2 million changes, you process just those 2 million change events.

This reduces processing time from hours to minutes and enables near real time SCD updates. Netflix uses CDC powered streaming to maintain dimension tables with sub minute latency for personalization analytics. The operational price you pay is complexity: setting up CDC connectors, handling schema evolution, managing streaming state, and dealing with out of order events.

✓ In Practice: Hybrid approaches are common. Use CDC for high change dimensions like Product and Price, where millions of rows change daily. Use snapshot comparison for slow moving dimensions like Store or Organizational Hierarchy, where only hundreds of rows change per day.
The Type 2 Algorithm in Detail:

For each changed business key, the core Type 2 logic must be atomic. First, query for existing current row WHERE business_key = X AND current_flag = true. If none exists, this is a new entity: insert a row with a new surrogate key, effective start equal to processing time, effective end equal to a sentinel like December 31 9999, and current flag true.

If a current row exists, compare attributes. If different, this is an update: execute an UPDATE to set the existing row's effective end to now and current flag to false, then INSERT a new row with a new surrogate key, the new attribute values, effective start of now, effective end of December 31 9999, and current flag true.

In distributed systems, you must partition by business key to ensure all versions for an entity are processed together. Some teams use optimistic locking with version numbers to detect concurrent updates. At 200,000 events per minute, parallelizing across thousands of business keys is essential to meet latency targets.

Edge Case: Late Arriving Changes:

Suppose a CDC event for a customer address change from Monday arrives on Wednesday due to a source system delay. If you naively set effective start to Wednesday, you've created a temporal gap. Facts from Tuesday will have no matching dimension row for that business key with an effective period covering Tuesday.

Robust implementations use the source system's transaction timestamp as effective start, not the processing time. This requires carrying event time metadata through the pipeline and handling backfill scenarios where you need to insert a version into the middle of an existing timeline.

💡 Key Takeaways

✓Snapshot comparison must process all rows (500 million products) to find changes (2 million), taking 4 to 8 hours, while CDC processes only changed rows in 5 to 15 minutes

✓At 0.4% daily change rate on large dimensions, CDC reduces processing volume by 99.6%, making near real time SCD feasible for high throughput systems

✓Type 2 algorithm must be atomic per business key: close existing current row by setting effective end and current flag false, insert new row with new surrogate key

✓Late arriving changes require using source transaction timestamp as effective start, not processing time, to avoid temporal gaps in dimension timeline

✓At 200,000 change events per minute, must partition by business key and parallelize across thousands of workers to meet sub 15 minute SLA targets

📌 Interview Tips

1Walmart Product dimension: 200 million products, 1 million price changes per day, CDC based SCD completes in 10 minutes vs 6 hours with snapshot

2Type 2 update: Customer ID 12345 currently has surrogate key 98765 with city Seattle. Update arrives: set row 98765 effective_end to 2024-01-15 and current_flag false, insert new row with surrogate key 98766, city Portland, effective_start 2024-01-15, current_flag true

3Late arrival scenario: Address change event from Monday (event_time = 2024-01-08) arrives Wednesday. Use 2024-01-08 as effective_start, not 2024-01-10, so Tuesday facts join correctly

4False positive explosion: Including load_timestamp in snapshot comparison causes every row to appear changed, creating 500 million spurious version rows

← Back to Slowly Changing Dimensions (SCD) Overview