Train Serve Skew from PIT Violations
What Train Serve Skew Is
Train serve skew occurs when offline training features differ systematically from online serving features, causing models to underperform in production despite strong offline metrics. Point in Time (PIT) violations are a primary cause: if training uses future leaked data or serving uses stale data, the distribution mismatch degrades accuracy by 5 to 20 percent in production systems.
Processing Time vs Event Time Bug
The most common violation is joining on processing time instead of event time during training. A fraud feature indicating "transactions in last hour" computed at 2pm but joined using 3pm processing time includes data from the future hour. The model learns to exploit this leaked signal, achieving inflated offline AUC that collapses when serving with true real time features.
Stale Feature Serving
The inverse problem occurs when online serving uses stale features while training used fresh data. If batch materialization runs daily but training labels are hourly, the serving path sees features 12 hours staler on average than training. Models learn to expect fresh signals and degrade when those signals are delayed.
Detection Methods
Compare offline and online feature distributions using PSI (Population Stability Index) or KL divergence. A PSI above 0.2 to 0.3 indicates meaningful drift warranting investigation. Log serving requests with features, replay them through offline pipelines, and diff the results. LinkedIn runs continuous shadow comparison detecting divergence before it impacts business metrics.
Prevention Architecture
Use unified transformation logic compiled to both batch and streaming pipelines. Version feature definitions and pin model deployments to specific versions. Inject synthetic timestamp jitter during training to build robustness to minor temporal misalignment.