Time Series Forecasting • Feature Engineering (Lag Features, Rolling Stats, Seasonality)Hard⏱️ ~3 min
Point in Time Correctness and Preventing Leakage
Point in time correctness ensures that every feature value used for a prediction at time t is computable from data available strictly before t. This constraint prevents leakage, where future information accidentally improves training accuracy but cannot exist at inference time, causing catastrophic performance drops in production. It is the single most critical correctness property for time series systems.
Leakage happens subtly with rolling windows. Suppose you're predicting end of day sales and compute a 7 day rolling mean at end of day. That mean includes today's sales, which is the label you're trying to predict. During training, the model learns to rely on this leaked signal. In production, when you predict tomorrow's sales, today's sales won't be available yet, so the feature is missing or stale. Accuracy plummets.
The fix is strict timestamp discipline. For training, backfill features with a cutoff timestamp before the label timestamp. If the label is sales on day t, compute all rolling aggregates over the window ending at start of day t, or t minus epsilon where epsilon is one time unit. This ensures the feature uses only data observable at prediction time. For online serving, the feature service must apply the same cutoff logic: when a prediction request arrives at time t, return aggregates over windows ending before t.
Implementing this at scale requires point in time joins. A naive join on entity key alone mixes past and future data. A time aware join matches each label with the feature snapshot valid at a cutoff before the label time. In SQL, this looks like joining on entity key and feature timestamp less than label timestamp minus offset. In batch pipelines, materialize feature snapshots at fixed intervals (hourly or daily), then join labels to the most recent snapshot before each label time. Systems like Airbnb Zipline and Uber Michelangelo build this logic into their feature platform, enforcing it declaratively so engineers cannot accidentally introduce leakage.
Validation catches leakage before production. Unit tests inject synthetic sequences where leakage would flip predictions. For example, create a series with a spike at time t. If a rolling mean feature incorrectly includes time t, the model will predict the spike accurately in training. If the feature correctly excludes time t, the model cannot predict it. Compare training accuracy on these synthetic examples to expected bounds. Additionally, run shadow scoring where you recompute online features from offline data on sampled requests and measure divergence. If online features systematically differ from offline by more than 2 percent relative error, investigate cutoff logic.
Late arriving data adds complexity. Events may arrive seconds or minutes after their event time due to network delays or processing lag. Streaming systems use watermarks to track progress. A watermark at time w means all events with event time before w have arrived. Features computed at watermark w are safe to use for predictions at times after w. Configure bounded lateness, for example 1 hour, to allow corrections within that window, then finalize. Track the rate of late arrivals beyond the lateness bound. If more than 1 percent arrive late, widen the bound or accept known bias and monitor impact.
Real systems face tension between freshness and correctness. Fresher features improve accuracy, but waiting for data increases latency. Uber's ETA model balances this by using different lateness bounds per feature: 30 seconds for current traffic conditions (high value, tolerate some late data), 5 minutes for rolling averages (lower urgency, wait for completeness). This tiered approach keeps end to end latency under 100 milliseconds at p95 while maintaining point in time correctness for 99 percent of predictions.
💡 Key Takeaways
•Point in time correctness requires every feature at time t to be computable from data strictly before t. Leakage occurs when rolling windows or aggregates accidentally include the label timestamp, inflating training accuracy but causing production failures
•A 7 day rolling mean for end of day prediction must aggregate days 0 through 6, not days 1 through 7, or it includes today's sales (the label). Use cutoff at start of day or t minus epsilon to enforce this
•Implement point in time joins that match labels to feature snapshots with timestamp strictly before label time. Airbnb Zipline and Uber Michelangelo enforce this declaratively, preventing engineers from introducing leakage through naive joins on entity key alone
•Validation with synthetic sequences catches leakage before production. Create a series with a spike at time t. If rolling features incorrectly include t, model predicts the spike accurately in training but fails in production. Compare accuracy on these tests to expected bounds
•Late arriving events require watermarks and bounded lateness. Uber allows 30 seconds for traffic features (high value) and 5 minutes for rolling averages (lower urgency), finalizing features after lateness bounds to maintain p95 latency under 100 milliseconds
•Shadow scoring compares online features to offline recomputed values on sampled requests. If divergence exceeds 2 percent relative error, investigate cutoff logic, late data handling, or incremental aggregation bugs that cause training serving skew
📌 Examples
Amazon demand forecasting backfills training data with features computed at start of day cutoff. For a label on day t, all rolling windows aggregate through day t minus 1. This prevents leakage from same day sales and maintains production accuracy within 2 percent of offline validation
Stripe fraud detection uses watermarks with 1 minute bounded lateness for transaction count features. Events arriving more than 1 minute late (less than 0.5 percent of volume) are logged but not included, preventing unbounded state growth while maintaining 99.5 percent feature completeness
Netflix recommendation training pipeline materializes hourly feature snapshots, then joins each user session label to the snapshot valid at session start time minus 5 minutes. This ensures real time features like current watch progress don't leak into training, keeping offline and online Mean Average Precision (MAP) within 1 percent