Feature Engineering & Feature StoresPoint-in-Time CorrectnessHard⏱️ ~3 min

Implementing Temporal As Of Joins for PIT Correctness

The as of join is the fundamental algorithm for Point in Time (PIT) correctness, solving the problem of matching each training label to the exact feature values available at label time. For a label row with entity ID and label time, the join selects the latest feature record where event timestamp is less than or equal to label time. This differs from standard joins which match on exact equality or use current values. The naive approach of sorting all feature history and scanning for each label scales poorly. Production systems partition by entity ID first, then sort by event time within partitions, enabling last observation carried forward semantics. At 100 million plus row scale, further optimization buckets data by day or hour to bound scan sizes. Uber and Airbnb cache frequent as of lookups by entity ID and day, reducing repeated temporal scans. The compute cost is 1.5 to 4 times higher than simple latest value joins due to additional partitioning, sorting, and windowing operations. Window semantics require explicit definition: a 7 day feature window at label time t means aggregating over the interval [t minus 7 days, t) with clear inclusive/exclusive boundaries. Off by one errors at boundaries (using t versus t minus 1 second) change counts and introduce subtle train serve skew. All timestamps must use a single standard like UTC to avoid timezone confusion. Airbnb Zipline enforces these semantics declaratively, computing aggregates over well defined windows during dataset generation. Handling late arriving data adds complexity. If a feature with event time 2pm arrives at 3pm, it must be inserted at its event time position, potentially rewriting history. Systems define watermarks (for example, 99.9 percent of events within 2 hours) and late data acceptance windows (typically 24 to 72 hours). Events beyond the window may be rejected for online serving but applied retroactively to offline training through backfills. This trades real time freshness for stability.
💡 Key Takeaways
Partition by entity ID then sort by event time within partitions for last observation carried forward, avoiding full table scans at 100 million plus row scale
Pre bucket by day or hour to bound scan sizes, with caching of frequent entity ID and day lookups to reduce repeated temporal scans in production pipelines
Window semantics must be explicit with clear inclusive/exclusive boundaries like [t minus 7 days, t) to avoid off by one errors that inflate offline metrics but degrade online serving
Late arriving data requires watermarks (for example, p99.9 within 2 hours) and acceptance windows (24 to 72 hours), with events beyond window applied retroactively offline only
Offline PIT join throughput typically 50 to 200 million joined rows per hour per node for wide 50 to 200 feature datasets with proper partitioning
All timestamps stored in single standard like UTC with monotonicity validation per entity to prevent clock skew causing leakage or missing joins at boundaries
📌 Examples
Uber Palette point in time joiner service builds training sets by partitioning 1 billion row datasets by entity, sorting by event time, and caching frequent temporal lookups to achieve hour scale dataset generation
Airbnb Zipline computes windowed aggregates like 7/30/90 day counts with explicit window end at event time, enforcing UTC timestamps and inclusive/exclusive semantics to prevent boundary bugs
Real join query: SELECT label.*, features.clicks FROM labels LEFT JOIN LATERAL (SELECT clicks FROM feature_history WHERE entity_id = labels.entity_id AND event_time <= labels.label_time ORDER BY event_time DESC LIMIT 1) features
← Back to Point-in-Time Correctness Overview
Implementing Temporal As Of Joins for PIT Correctness | Point-in-Time Correctness - System Overflow