Training Serving Skew and Distribution Drift
What Training Serving Skew Is
Occurs when the features used during training differ from those served at inference, causing offline AUC to significantly exceed online performance. Common causes include different transformation code paths (training uses Spark UDFs, serving uses Python), incorrect time filters that leak future data into training, or schema mismatches where a feature type changes between offline and online stores. Symptoms manifest as offline AUC of 0.92 dropping to online AUC of 0.76. The blast radius is large: a 10 percent accuracy drop can reduce CTR by 15 to 25 percent.
Mitigation Through Unified Logic
Feast and Tecton enforce this by using the same transformation definitions for both offline backfills and online materialization. Airbnb Zipline requires that feature pipelines produce both offline datasets and online values from identical code, preventing divergence. Point in time joins with "as of" semantics ensure training examples only see features available at the example timestamp. Automated validation compares offline and online distributions using PSI or KL divergence; a PSI above 0.2 or KL divergence above 0.1 triggers alerts before model deployment.
Online Offline Drift
Happens when feature groups are deployed to one store without updating the other. Deploying a new feature view to the online key value store without backfilling the offline lake means training on old logic while serving new logic. The mitigation is versioned feature groups with release gates: backfill offline first, validate distributions match, then cut over online serving. Shadow reads during cutover compare both versions in production.
Late Data Drift
An event arriving 10 minutes late may miss the window close in streaming aggregation but appear in the next day's batch backfill, creating offline online count mismatches. The fix is event time processing with watermarks that delay window close to wait for late events, plus idempotent upserts keyed by entity, window end, and version. Compensating updates can correct closed windows when very late events arrive beyond the watermark.