ML-Powered Search & RankingFeature Engineering for RankingHard⏱️ ~3 min

Training Serving Skew and Point in Time Feature Correctness

Training serving skew occurs when features computed during training differ from features computed at serving time, causing offline metrics to overestimate online performance. The most common source is temporal leakage: using information from the future during training that will not be available at inference. For example, computing a 7 day Click Through Rate (CTR) feature for an impression at time T by aggregating clicks from T minus 7 days to T plus 2 days because logs arrive late. The model learns to rely on future clicks, inflating offline precision from 0.78 to 0.82, but at serving time those future clicks are unavailable and precision drops to 0.74. Point in time correctness requires snapshotting features exactly as they existed at the impression timestamp. If a user views an item on Monday at 10am, the training example must use the 7 day CTR computed from the prior Monday 10am to the current Monday 10am, excluding any events after 10am Monday. This constraint applies to all time windowed aggregates and to target encoded features. For target encoding, compute category conversion rates using only data strictly before the impression time, with out of fold splits and smoothing to avoid overfitting. Feature computation must handle late arriving events consistently between training and serving. In training, decide on a cutoff: accept events up to 15 minutes late, then freeze the aggregate. In serving, use the same 15 minute buffer. If training uses all eventually consistent data but serving uses only immediately available data, the distributions diverge. Streaming pipelines that maintain online aggregates should write snapshots to the offline store with the same late event tolerance, ensuring training and serving compute features identically. Another source of skew is default value handling. If a feature is missing at serving time due to a cache miss or new entity, the system substitutes a default, often zero or a global mean. If training data only includes cases where the feature was present, the model never learns to handle the default. Production systems inject synthetic missing values during training or use schema contracts that define fallback strategies, for example hierarchical backoff from item level to category level to global prior. The failure mode is subtle degradation. Offline Area Under the Curve (AUC) is 0.85, but online AUC is 0.79. A/B tests show no lift or even regression despite better offline metrics. Diagnosis requires logging feature values at serving time and comparing distributions to training distributions. Measure coverage: what fraction of candidates have each feature populated. Compute statistical divergence metrics like Kolmogorov Smirnov or KL divergence between training and serving feature histograms. Companies like Meta and Google enforce automated parity tests that flag when serving feature distributions drift beyond thresholds from training expectations.
💡 Key Takeaways
Temporal leakage from late arriving events inflates offline metrics: using clicks up to T plus 2 days for impression at T boosts offline precision from 0.78 to 0.82, but online precision is 0.74 because future clicks are unavailable
Point in time correctness snapshots features at impression timestamp: for impression at Monday 10am, use 7 day CTR computed from prior Monday 10am, excluding any events after current Monday 10am
Late event tolerance must match between training and serving: if training accepts events up to 15 minutes late, streaming pipelines must write online aggregates with the same 15 minute buffer to maintain consistency
Default value handling causes skew when training data only includes populated features: model never learns to handle cache misses or new entities that receive zero or global mean defaults at serving
Failure symptoms include offline AUC of 0.85 but online AUC of 0.79, with A/B tests showing no lift or regression despite better offline metrics
Detection requires logging serving features and comparing distributions: measure coverage, compute KL divergence between training and serving histograms, and enforce automated parity tests with alerts on drift
📌 Examples
Meta ranking models showed 0.84 offline AUC but 0.78 online AUC due to target encoding computed with future conversions. Fixing point in time correctness aligned offline and online metrics, reducing false confidence in model improvements
Airbnb found 12 percent of listings at serving time had missing calendar availability features due to cache misses, receiving default values. Training data had 100 percent coverage. Injecting 12 percent synthetic missingness during training improved online precision by 4 percent
Google enforces feature parity tests that compute checksums on sampled serving requests and compare to expected distributions from training. Alerts fire when KL divergence exceeds 0.05, catching data pipeline bugs before they degrade online metrics
← Back to Feature Engineering for Ranking Overview