Training Serving Skew Detection and Prevention
WHAT IS TRAINING-SERVING SKEW
Training-serving skew occurs when features computed during training differ from features computed during serving. The model learned on one feature definition but receives a different one in production. This causes silent prediction degradation.
Example: During training, user_activity_last_7d was computed using all events. In serving, it is computed using only pageview events (due to a bug). The feature values differ, predictions degrade, but no error is thrown.
COMMON CAUSES
Code duplication: Training and serving have separate feature computation code. They drift apart over time as one is updated without the other.
Data freshness differences: Training uses batch-computed features (point-in-time snapshots). Serving uses real-time computed features (current values). The timing difference changes feature values.
Missing value handling: Training imputes missing values one way. Serving imputes differently. Different imputation = different features.
Feature transformation bugs: Normalization parameters differ. Training normalizes with mean=10, std=5. Serving uses mean=12, std=6. Predictions shift.
DETECTION STRATEGIES
Shadow scoring: Run serving features through training pipeline. Compare results. Differences indicate skew.
Feature distribution monitoring: Compare serving feature distributions to training distributions. Significant divergence may indicate skew (or drift—investigate to distinguish).
Logging for offline comparison: Log serving features and predictions. Replay through training pipeline. Compare.
PREVENTION
Use a feature store that computes features once and serves to both training and inference. Shared feature definitions eliminate code divergence. This is the most effective prevention strategy.