ML-Powered Search & RankingFeature Engineering for RankingHard⏱️ ~3 min

Training Serving Skew and Point in Time Feature Correctness

Definition
Training-serving skew occurs when features computed during training differ from features computed at serving time, causing offline metrics to overestimate online performance.

Temporal Leakage: The Most Common Cause

The most common skew source is using future information during training. Example: computing a 7-day CTR feature for an impression at time T by aggregating clicks from T-7d to T+2d because logs arrived late. The model learns to rely on future clicks, inflating offline precision from 0.78 to 0.82. At serving time, those future clicks don't exist, and precision drops to 0.74.

Point-in-Time Correctness

If a user views an item Monday 10am, the training example must use the 7-day CTR computed from prior Monday 10am to current Monday 10am, excluding events after 10am. This applies to all time-windowed aggregates. For target-encoded features (using the target variable to encode categories), compute rates using only data strictly before the impression time.

Late Event Handling

Feature computation must handle late-arriving events consistently. In training, decide on a cutoff: accept events up to 15 minutes late, then freeze. In serving, use the same 15-minute buffer. If training uses all eventually-consistent data but serving uses only immediately-available data, distributions diverge. Write snapshots to the offline store with the same late-event tolerance used in serving.

Default Value Handling

If a feature is missing at serving time (cache miss, new entity), the system substitutes a default (zero or global mean). If training data only includes cases where the feature was present, the model never learns to handle defaults. Fix: inject synthetic missing values during training, or define fallback strategies (item-level → category-level → global prior).

⚠️ Diagnosis: Log feature values at serving time and compare distributions to training. Compute feature coverage (what fraction of candidates have each feature) and statistical divergence between training and serving histograms. Automated parity tests flag when distributions drift beyond thresholds.
💡 Key Takeaways
Temporal leakage (using future info during training) is the most common cause of offline/online metric gaps
Point-in-time correctness: training examples must use features as they existed at impression time, not current values
Late-event handling must be consistent: if serving uses 15-minute buffer, training must use same buffer
Inject synthetic missing values during training so model learns to handle defaults it will see at serving
Diagnose by logging serving features and comparing distributions to training; automate parity tests
📌 Interview Tips
1Give the classic example: 7-day CTR computed with T-7d to T+2d in training vs T-7d to T at serving
2Mention specific numbers: offline precision 0.82, online drops to 0.74 due to leakage
3Explain the fix hierarchy: snapshot features at impression time, same late-event buffer, inject missing values
← Back to Feature Engineering for Ranking Overview