Training Serving Skew Detection and Prevention

WHAT IS TRAINING-SERVING SKEW
Training-serving skew occurs when features computed during training differ from features computed during serving. The model learned on one feature definition but receives a different one in production. This causes silent prediction degradation.
Example: During training, user_activity_last_7d was computed using all events. In serving, it is computed using only pageview events (due to a bug). The feature values differ, predictions degrade, but no error is thrown.
COMMON CAUSES
Code duplication: Training and serving have separate feature computation code. They drift apart over time as one is updated without the other.
Data freshness differences: Training uses batch-computed features (point-in-time snapshots). Serving uses real-time computed features (current values). The timing difference changes feature values.
Missing value handling: Training imputes missing values one way. Serving imputes differently. Different imputation = different features.
Feature transformation bugs: Normalization parameters differ. Training normalizes with mean=10, std=5. Serving uses mean=12, std=6. Predictions shift.
DETECTION STRATEGIES
Shadow scoring: Run serving features through training pipeline. Compare results. Differences indicate skew.
Feature distribution monitoring: Compare serving feature distributions to training distributions. Significant divergence may indicate skew (or drift—investigate to distinguish).
Logging for offline comparison: Log serving features and predictions. Replay through training pipeline. Compare.
PREVENTION
Use a feature store that computes features once and serves to both training and inference. Shared feature definitions eliminate code divergence. This is the most effective prevention strategy.
💡 Key Insight: Training-serving skew is insidious because it causes silent degradation. No errors, just gradually worse predictions. Active detection is essential.

💡 Key Takeaways

✓Training-serving skew: features differ between training and serving; causes silent prediction degradation

✓Common causes: code duplication, data freshness differences, missing value handling, normalization parameter drift

✓Prevention: feature store with shared definitions; detection: shadow scoring, distribution monitoring, offline replay

📌 Interview Tips

1Interview Tip: Give a concrete skew example: activity feature computed from all events vs pageviews only.

2Interview Tip: Explain why feature stores prevent skew—shared definitions eliminate code divergence.

← Back to Data Quality Monitoring Overview