Feature Engineering & Feature StoresFeature Freshness & StalenessMedium⏱️ ~3 min

Event Time Semantics and Point in Time Correctness

Event time is when an event actually occurred in the real world, as opposed to processing time (when your system handled it). All freshness computation and monitoring must use event time to correctly handle late arriving and out of order data. For example, a user click at 2:00:00 PM that arrives in your pipeline at 2:00:45 PM due to network delays has event time of 2:00:00 PM. If you use processing time, you'll incorrectly measure freshness and potentially compute wrong aggregates like "clicks in last 5 minutes." Point in time correctness prevents label leakage during training, which is one of the most insidious bugs in production Machine Learning (ML) systems. Features must reflect only information available at the training example's timestamp, not future data. If you're predicting whether a user will convert on January 15th at 3 PM, your features must be computed from data available before 3 PM on January 15th, typically with a small operational delay buffer. Training on features that include future information inflates offline metrics but causes the model to fail in production. Implementing point in time correctness requires time travelable offline stores or versioned snapshots. When joining labels with features, you perform an "as of" join where each label at time T gets features computed from data up to time T minus operational delays. LinkedIn's Feathr and Uber's Michelangelo both enforce identical transformations between training and serving by defining feature logic once in feature views and materializing to both batch and online stores. This ensures training serving consistency. Each feature in production should carry metadata: last updated at timestamp, source watermark (how far the upstream pipeline has processed), version identifier, and computation window (like 30 minute sliding window). The online feature assembler uses this metadata to enforce freshness SLAs per feature. If a feature's age exceeds its Time To Live (TTL), the system can degrade gracefully by falling back to a stale snapshot, using a default value, or dropping the feature and relying on model robustness.
💡 Key Takeaways
Event time versus processing time matters for correctness. A 5 minute sliding window using processing time will miscount events that arrive late, while event time with watermarks handles delays up to a bounded lateness (typically 5 to 15 minutes).
Label leakage from incorrect time joins is common and devastating. One company reported 15% Area Under Curve (AUC) drop when deploying a model trained with future feature values that weren't actually available at serving time.
Watermarks bound how late data can arrive. Setting a 5 minute watermark means events more than 5 minutes late are dropped or sent to a dead letter queue, preventing unbounded state growth in streaming jobs.
Feature metadata enables runtime freshness enforcement. If a feature has ttl of 60 seconds and current age is 90 seconds, the online assembler can log a violation, substitute a default, or include an age feature for the model.
Identical transformation logic between training and serving is critical. Defining features once and materializing to both batch (for training) and online stores (for serving) prevents subtle bugs from code drift.
Time travel queries or versioned snapshots add storage cost. Maintaining 90 days of point in time queryable features can be 3x to 10x more expensive than keeping only current values, but it's essential for correct retraining.
📌 Examples
DoorDash maintains event time windows for features like "orders in last 30 minutes for this store." Late arriving orders due to mobile network delays are correctly included if within the 5 minute watermark, ensuring accurate busy state.
Uber enforces point in time correctness by snapshotting feature values hourly in offline stores. When training an Estimated Time of Arrival (ETA) model, labels from 3 PM on Jan 15 join with features from the 2:55 PM snapshot, never using data after 3 PM.
A fraud detection team discovered their model had 0.92 offline AUC but only 0.78 online. Root cause was training features included transaction outcomes that occurred hours after the prediction time, leaking future labels into training data.
← Back to Feature Freshness & Staleness Overview