Point in Time Joins and Slowly Changing Dimensions
The Point in Time Join Problem
Point in time joins ensure features computed as of timestamp t only use dimension and aggregate values valid strictly before t. For a user purchase count feature evaluated on January 15th, you must join to user profile, product catalog, and historical aggregates as they existed on January 14th or earlier. Using current state introduces label leakage that inflates training metrics.
Slowly Changing Dimensions
User attributes change over time: addresses, subscription tiers, preferences. A Slowly Changing Dimension (SCD) table tracks these changes with valid from and valid to timestamps. Point in time joins select the SCD record where valid from is less than or equal to feature timestamp and valid to is greater than feature timestamp, reconstructing the state at that historical moment.
Implementation in Spark
Implement as of joins using window functions or range joins. Partition facts by entity and date, partition dimensions by entity. For each fact row, find the most recent dimension row with timestamp less than or equal to fact timestamp. Optimize using bucketing on entity ID and sorting by timestamp to enable merge joins instead of broadcast or shuffle hash joins.
Aggregate Feature Backfills
Rolling aggregates (purchases in last 30 days) require special handling. For each training row, you cannot simply compute the current 30 day aggregate; you must compute the 30 day aggregate as of that row timestamp. This means the aggregation window shifts for each row, dramatically increasing compute versus a single current window.
Validation Checks
Assert that no joined dimension timestamp exceeds the fact timestamp. Run anti join queries in CI to catch point in time violations before training begins. A single leaked future value can corrupt thousands of training examples.