Feature Engineering & Feature Stores • Point-in-Time CorrectnessHard⏱️ ~3 min
PIT Correctness Failure Modes and Edge Cases
Point in Time (PIT) correctness fails in subtle ways that can go undetected for months while silently degrading model quality. Out of order and late arriving data is the most common failure. Streaming systems deliver events late with p95 on time but p99 late by minutes to hours. If the system gates by processing time instead of event time, late arrivals overwrite history and contaminate training labels. A user action at 2pm arriving at 4pm appears available at 2pm in naive systems, leaking future information.
Clock skew across distributed services creates boundary leakage. Service clocks skewed by seconds to minutes cause subtle violations at window boundaries: a 7 day window computed on one service includes events a different service excludes due to time disagreement. This manifests as feature value mismatches between training and serving of 1 to 5 percent at window edges. Production systems enforce UTC timestamps, validate monotonicity per entity, and quantify skew budgets with monitoring.
Entity key churn breaks PIT semantics when users or devices merge or split. If not handled, features from one entity contaminate another at join time. A user merging two accounts inherits history from both, but the model trained before the merge used separate feature vectors. Maintain explicit entity resolution history with effective start and end times, joining on the resolved identity as of time t. Airbnb handles this for users who book from multiple devices or accounts.
Aggregation window boundary bugs are off by one errors on inclusive versus exclusive semantics. Using [t minus 7 days, t] instead of [t minus 7 days, t) changes counts by 1 at every boundary. Over millions of training examples, this systematically inflates offline metrics by 0.5 to 2 percent but degrades online serving if the boundary logic differs. Explicitly document and test boundary conditions with synthetic data at window edges.
Partial failures and replays in streaming consumers may replay a segment, duplicating updates. Without idempotency keys and event time conflict resolution, last write wins by processing time corrupts the timeline, inserting duplicate feature updates that inflate aggregates.
💡 Key Takeaways
•Out of order data with p99 lateness of minutes to hours requires watermarks (for example, 24 hour late data acceptance) and retraction policies to prevent late arrivals from overwriting history incorrectly
•Clock skew of seconds to minutes across distributed services creates 1 to 5 percent feature mismatches at window boundaries, requiring UTC enforcement and per entity monotonicity validation
•Entity key churn from user or device merges contaminates feature vectors if not handled with explicit entity resolution history maintaining effective start and end times
•Aggregation boundary bugs using inclusive versus exclusive semantics (t versus t minus 1 second) systematically inflate offline metrics 0.5 to 2 percent while degrading online serving
•Partial replay failures in streaming consumers duplicate updates without idempotency keys and event time conflict resolution, inflating aggregate features by 5 to 20 percent
•Hard deletes for GDPR or data retention erase history needed for time travel training, requiring tombstones with effective times to maintain reproducibility while respecting deletion
📌 Examples
Ad click prediction system: p95 clicks arrive within 1 minute, p99 within 1 hour. Without watermarks, late clicks leak into past training examples causing 10 percent precision drop when detected and fixed
Multi region service with 30 second clock skew: 7 day rolling window computed at boundary includes 30 seconds more data in one region, causing 2 percent feature value divergence detected in A/B test
User account merge at Airbnb: user books from mobile app (entity A) and web (entity B), then merges accounts. Feature join must use entity resolution effective date to avoid cross contamination in historical training data