Training Serving Skew and Compatibility Failures in Rollback
What is Training Serving Skew
Training serving skew occurs when feature transformations, data preprocessing, or schemas differ between offline training and online serving. This creates silent model quality degradation even when infrastructure metrics look healthy. The model was trained on one data distribution but receives a different one in production. Everything looks fine at the infrastructure layer: latency is good, no errors, GPU utilization is healthy. But predictions are wrong.
A Classic Failure Scenario
The new model expects feature F_v2 with normalization computed over the past 30 days, but serving still emits F_v1 normalized over 7 days. Canary passes latency and error rate checks but accuracy drops 10 to 15 percent, discovered only after full rollout when business KPIs lag. By the time conversion rate drops are statistically significant, thousands of users have received degraded predictions.
Schema Incompatibility During Rollback
Schema incompatibility manifests during rollback when the old model expects features no longer computed or retained. If the new model introduced feature F_new and deprecated F_old, rolling back requires F_old to be backfilled or the old model falls back to defaults, spiking null rates and degrading predictions. LinkedIn and Uber enforce feature TTL policies aligned with rollback windows: retain historical feature definitions for at least 30 to 90 days so any recent production model can be served.
Mitigation Strategies
Shared feature definitions between training and serving prevent divergence. Schema validation gates at model promotion catch incompatibilities before deployment. Compatibility tests during canary verify the model receives expected feature distributions. Airbnb's Airflow orchestrated backfills maintain training serving parity so reverted models can run on current data. For breaking changes, dual run windows where both F_v1 and F_v2 are computed in parallel allow gradual migration and safe rollback during transition periods. Feature stores with time travel capability (point in time reads) enable reconstructing historical feature values for forensic analysis.