Training Serving Skew and Point in Time Feature Correctness

Definition
Training-serving skew occurs when features computed during training differ from features computed at serving time, causing offline metrics to overestimate online performance.
Temporal Leakage: The Most Common Cause
The most common skew source is using future information during training. Example: computing a 7-day CTR feature for an impression at time T by aggregating clicks from T-7d to T+2d because logs arrived late. The model learns to rely on future clicks, inflating offline precision from 0.78 to 0.82. At serving time, those future clicks don't exist, and precision drops to 0.74.
Point-in-Time Correctness
If a user views an item Monday 10am, the training example must use the 7-day CTR computed from prior Monday 10am to current Monday 10am, excluding events after 10am. This applies to all time-windowed aggregates. For target-encoded features (using the target variable to encode categories), compute rates using only data strictly before the impression time.
Late Event Handling
Feature computation must handle late-arriving events consistently. In training, decide on a cutoff: accept events up to 15 minutes late, then freeze. In serving, use the same 15-minute buffer. If training uses all eventually-consistent data but serving uses only immediately-available data, distributions diverge. Write snapshots to the offline store with the same late-event tolerance used in serving.
Default Value Handling
If a feature is missing at serving time (cache miss, new entity), the system substitutes a default (zero or global mean). If training data only includes cases where the feature was present, the model never learns to handle defaults. Fix: inject synthetic missing values during training, or define fallback strategies (item-level → category-level → global prior).
⚠️ Diagnosis: Log feature values at serving time and compare distributions to training. Compute feature coverage (what fraction of candidates have each feature) and statistical divergence between training and serving histograms. Automated parity tests flag when distributions drift beyond thresholds.

💡 Key Takeaways

✓Temporal leakage (using future info during training) is the most common cause of offline/online metric gaps

✓Point-in-time correctness: training examples must use features as they existed at impression time, not current values

✓Late-event handling must be consistent: if serving uses 15-minute buffer, training must use same buffer

✓Inject synthetic missing values during training so model learns to handle defaults it will see at serving

✓Diagnose by logging serving features and comparing distributions to training; automate parity tests

📌 Interview Tips

1Give the classic example: 7-day CTR computed with T-7d to T+2d in training vs T-7d to T at serving

2Mention specific numbers: offline precision 0.82, online drops to 0.74 due to leakage

3Explain the fix hierarchy: snapshot features at impression time, same late-event buffer, inject missing values

← Back to Feature Engineering for Ranking Overview