Point in Time Correctness and Preventing Leakage
The Point-in-Time Problem
When training on historical data, features must reflect only information available at prediction time. If you compute "30-day rolling mean" for a prediction made January 15, 2024, you must use data from December 16, 2023 to January 14, 2024—not data that includes January 15 or later. Using future information is called data leakage and inflates offline metrics while failing in production.
Warning: Leakage often produces models with 95%+ accuracy offline that perform at 50% in production. The gap is diagnostic: if production metrics are dramatically worse than offline, audit for leakage.
Common Leakage Sources
Lag features using unavailable lags: lag-1 is unavailable for 7-day ahead forecast. Rolling statistics including the prediction day. Target encoding computed on the entire dataset (including test rows). Shuffle splitting that mixes time periods. Late-arriving data: revenue finalized days after transaction but used as if immediately available.
Prevention Strategies
Always use time-based splits, not random splits. Define a cutoff timestamp for each prediction. Filter features to only include data before cutoff. Build feature pipelines that take prediction_time as explicit parameter. Unit test features by verifying no future data is accessed.
Validation Pattern: For each training example at time t with horizon h, verify all features use only data from time < t. Log serving-time features and compare to batch-computed features for same prediction—divergence indicates leakage.
Feature Store Considerations
Feature stores must support point-in-time queries: "give me features as they would have been at time T." Without this, training features differ from serving features. Production feature values reflect current state; training needs historical state. Time-travel capability is essential for correct backtesting.