SCD Failure Modes: Overlapping Periods and Late Facts

The Overlap Catastrophe:

For any business key in a Type 2 dimension, you must never have two rows marked as current simultaneously, and effective periods must never overlap. Suppose Customer 12345 has two rows both with current_flag = true. When a fact joins to get current customer attributes, which row does it pick? Most databases will non deterministically return one, creating random inconsistencies in your analytics.

At Amazon scale with 0.01% error rate, this means hundreds of thousands of corrupted dimension records per day. In a distributed SCD pipeline processing 200,000 changes per minute, race conditions are the primary cause. Two workers simultaneously process updates for the same business key. Both read the current row, both try to close it and insert a new version. You end up with the old row closed twice and two new current rows inserted.

❗ Remember: Traditional database uniqueness constraints cannot help you here. Current_flag = true is not unique across the table, only within each business key. You need composite constraints or application level locking.

Prevention requires atomicity per business key. Common patterns include: partitioning so all versions of a business key are handled by a single worker, using optimistic locking with version numbers where the UPDATE includes WHERE version = expected_version, or serializing updates per business key through a keyed state store in streaming systems.

Detection is equally critical. Daily validation queries should check for duplicates: SELECT business_key, COUNT(*) FROM dimension WHERE current_flag = true GROUP BY business_key HAVING COUNT(*) > 1. Also check for overlapping intervals: self join the dimension table WHERE d1.business_key = d2.business_key AND d1.surrogate_key != d2.surrogate_key AND d1.effective_start < d2.effective_end AND d2.effective_start < d1.effective_end.

Late Arriving Facts:

A more subtle failure happens when facts arrive late but you've already processed dimension changes. Suppose Monday an order event happens with customer in Gold tier. Tuesday the customer upgrades to Platinum, creating a new dimension version. Wednesday the Monday order event finally arrives due to a delayed batch.

If your fact enrichment always joins to the current dimension row (WHERE current_flag = true), that late Monday order will incorrectly be attributed to Platinum tier. Your churn analysis will show this customer generated revenue while in Platinum, when they were actually in Gold.

The correct approach requires facts to carry event_time and dimension tables to support temporal lookups. Instead of WHERE current_flag = true, join WHERE fact.event_time BETWEEN dim.effective_start AND dim.effective_end. This is significantly more expensive, requiring range scans on temporal columns, but it's the only way to maintain historical correctness.

At Netflix scale with millions of events per second, maintaining in memory caches of dimension timelines for sub millisecond lookups becomes necessary. The cache must hold not just current versions but recent historical versions to handle late arrivals within acceptable windows, typically 7 to 30 days.

Soft Deletes and Reactivation:

Dimension entities sometimes deactivate and reactivate. A Seller on a marketplace might close their account, then reopen it six months later. If you treat reactivation as just another Type 2 update, you create a new version row with an active status. But what about surrogate key stability? Should facts referencing the old surrogate key still be valid?

Incorrect handling causes double counting. Suppose a metric counts active sellers by counting distinct current surrogate keys. The reactivated seller now has two surrogate keys in history, both marked current at different times. Naive queries might count them twice.

One solution is to treat deactivation as setting effective_end but keeping current_flag false, not creating a deleted row. Reactivation then updates that row to set effective_end back to the sentinel value and current_flag back to true, reusing the same surrogate key. This requires more complex state management but avoids surrogate key proliferation.

💡 Key Takeaways

✓Race conditions in distributed pipelines cause overlapping current rows, with 0.01% error rate at Amazon scale meaning hundreds of thousands of corrupted records per day

✓Overlaps require atomicity per business key through partitioning, optimistic locking with version numbers, or serialized keyed state in streaming systems

✓Late arriving facts joined to current dimension rows cause incorrect attribution, such as a Monday order from Gold tier customer incorrectly showing as Platinum after Tuesday upgrade

✓Temporal lookups with event_time BETWEEN effective_start AND effective_end maintain correctness but require expensive range scans and careful indexing

✓Soft deletes and reactivations create surrogate key proliferation and double counting if not handled with explicit deactivation status rather than new version rows

📌 Interview Tips

1Detection query for overlaps: SELECT business_key, COUNT(*) FROM dim WHERE current_flag = true GROUP BY business_key HAVING COUNT(*) > 1

2Temporal join for late facts: FROM orders o JOIN customer_dim c ON o.customer_id = c.business_key AND o.order_time BETWEEN c.effective_start AND c.effective_end

3Netflix scale: maintain 30 day in memory dimension timeline cache to handle late arrivals with sub millisecond lookup for millions of events per second

4Reactivation scenario: Seller deactivated June 1 (effective_end = June 1, current_flag = false), reactivated December 1, set same row effective_end = December 31 9999, current_flag = true, avoid new surrogate key

← Back to Slowly Changing Dimensions (SCD) Overview