Training Serving Skew: Root Causes and Mitigation
Training-Serving Skew: When features at prediction time differ from training, causing models to underperform in production. The model learned patterns from one data distribution but receives another at inference. This is the most insidious ML production bug because the system appears to work correctly.
Root Cause: Different Code Paths
The most common cause is duplicate implementation. Training pipeline computes features using PySpark on historical data. Serving computes the same features using Java on real-time streams. Despite best intentions, implementations drift: different handling of nulls, different timestamp parsing, different rounding behavior. A feature defined as "clicks in last 7 days" might use calendar days in training but rolling 168 hours in serving. Both are reasonable—but not identical.
Root Cause: Future Data Leakage
Training data is processed in batch after events occur. If not careful, features can incorporate information not available at prediction time. Example: training computes "user lifetime value" including purchases after the prediction timestamp. The model learns to rely on this, but at serving time, future purchases do not exist. Point-in-time joins are essential: only data available before T can be used.
Mitigation Strategies
Single code path: Feature store computes features once for both training and inference. No duplicate means no drift. Logged features: Log exact feature values at serving; training uses these. Guarantees identical values. Feature monitoring: Compare distributions between training and serving. Alert on divergence (KL, PSI).
Detection: Compare offline evaluation to online A/B results. Large gaps (offline AUC 0.85, online 0.72) strongly indicate skew. Investigate feature distributions before blaming the model.