Feature Engineering & Feature Stores • Backfilling & Historical FeaturesHard⏱️ ~3 min
Point in Time Joins and Slowly Changing Dimensions
Point in time joins ensure features computed as of timestamp t only use dimension and aggregate values valid strictly before t. For a user purchase count feature evaluated on January 15th, you must join to the user's profile, product catalog, and historical aggregates as they existed on January 14th or earlier. Using current state introduces label leakage where future information contaminates training, inflating offline metrics but causing production failures.
Slowly changing dimensions (SCDs) are particularly treacherous. A product price that changed from $10 to $15 on January 10th must be joined correctly: rows before January 10th see $10, rows on or after see $15. This requires dimension tables with valid from and valid to timestamps or versioned snapshots. The join predicate becomes: fact.event_time >= dim.valid_from AND fact.event_time < dim.valid_to. Airbnb Zipline automates these as of joins, preventing the silent leakage that occurs when analysts accidentally join to current dimension state.
For aggregates like 7 day rolling purchase sum, point in time semantics mean computing over the window [t minus 7 days, t), strictly excluding events at or after t. Off by one errors are common: including the current day's events or using processing time instead of event time. Even a 0.1% to 0.5% mismatch between offline and online computations causes ranking model degradation in production. Meta's Feature Store enforces automated parity checks, comparing offline backfilled values against online computed values for sampled entities and targeting greater than 99.9% exact match.
Late arriving data complicates point in time correctness. An event with event time January 5th that arrives on January 12th should update the January 5th aggregates, but if backfill has already published that partition, you have stale data. Solutions include defining a maximum allowed lateness (e.g., 7 days), extending backfill ranges to capture the tail, and running straggler sweep jobs to correct recent partitions after the lateness window closes.
💡 Key Takeaways
•Point in time joins use as of semantics where fact at time t joins dimensions valid at t (valid_from ≤ t < valid_to) to prevent label leakage from future dimension changes
•Slowly changing dimensions without valid from and valid to timestamps cause silent leakage; joining to current state can shift model accuracy by 5% to 10% between training and production
•Aggregates must compute over [t minus window, t) strictly excluding events at or after t; including current time events is an off by one error causing training serving skew
•Meta Feature Store enforces greater than 99.9% exact match between offline backfilled and online computed values through automated parity checks on sampled entities
•Late arriving data requires maximum allowed lateness policies (e.g., 7 days) and straggler sweep jobs to correct recent partitions, trading freshness for correctness
📌 Examples
A fraud detection model joins user account status (active, suspended, closed) as of transaction time; using current status leaks future fraud labels, inflating training AUC from 0.85 to 0.92 but failing in production
Airbnb Zipline automatically generates as of joins with valid from and valid to predicates, reducing feature onboarding from weeks to days by preventing manual join errors
A 7 day purchase count feature computed offline includes events from Jan 8 to Jan 14 for a Jan 15 training row, while online serving excludes Jan 15 purchases in real time; misalignment causes 0.3% skew and ranking degradation