Feature Engineering & Feature StoresBackfilling & Historical FeaturesHard⏱️ ~3 min

Point in Time Joins and Slowly Changing Dimensions

The Point in Time Join Problem

Point in time joins ensure features computed as of timestamp t only use dimension and aggregate values valid strictly before t. For a user purchase count feature evaluated on January 15th, you must join to user profile, product catalog, and historical aggregates as they existed on January 14th or earlier. Using current state introduces label leakage that inflates training metrics.

Slowly Changing Dimensions

User attributes change over time: addresses, subscription tiers, preferences. A Slowly Changing Dimension (SCD) table tracks these changes with valid from and valid to timestamps. Point in time joins select the SCD record where valid from is less than or equal to feature timestamp and valid to is greater than feature timestamp, reconstructing the state at that historical moment.

Implementation in Spark

Implement as of joins using window functions or range joins. Partition facts by entity and date, partition dimensions by entity. For each fact row, find the most recent dimension row with timestamp less than or equal to fact timestamp. Optimize using bucketing on entity ID and sorting by timestamp to enable merge joins instead of broadcast or shuffle hash joins.

Aggregate Feature Backfills

Rolling aggregates (purchases in last 30 days) require special handling. For each training row, you cannot simply compute the current 30 day aggregate; you must compute the 30 day aggregate as of that row timestamp. This means the aggregation window shifts for each row, dramatically increasing compute versus a single current window.

Validation Checks

Assert that no joined dimension timestamp exceeds the fact timestamp. Run anti join queries in CI to catch point in time violations before training begins. A single leaked future value can corrupt thousands of training examples.

💡 Key Takeaways
Point in time joins use as of semantics where fact at time t joins dimensions valid at t (valid_from ≤ t < valid_to) to prevent label leakage from future dimension changes
Slowly changing dimensions without valid from and valid to timestamps cause silent leakage; joining to current state can shift model accuracy by 5% to 10% between training and production
Aggregates must compute over [t minus window, t) strictly excluding events at or after t; including current time events is an off by one error causing training serving skew
Meta Feature Store enforces greater than 99.9% exact match between offline backfilled and online computed values through automated parity checks on sampled entities
Late arriving data requires maximum allowed lateness policies (e.g., 7 days) and straggler sweep jobs to correct recent partitions, trading freshness for correctness
📌 Interview Tips
1A fraud detection model joins user account status (active, suspended, closed) as of transaction time; using current status leaks future fraud labels, inflating training AUC from 0.85 to 0.92 but failing in production
2Airbnb Zipline automatically generates as of joins with valid from and valid to predicates, reducing feature onboarding from weeks to days by preventing manual join errors
3A 7 day purchase count feature computed offline includes events from Jan 8 to Jan 14 for a Jan 15 training row, while online serving excludes Jan 15 purchases in real time; misalignment causes 0.3% skew and ranking degradation
← Back to Backfilling & Historical Features Overview