Learn→Training Infrastructure & Pipelines→Training-Serving Skew Prevention→2 of 6

Training Infrastructure & Pipelines • Training-Serving Skew PreventionMedium⏱️ ~3 min

Single Source of Truth: Unified Feature Definitions

The Foundational Strategy
The foundational strategy for preventing training serving skew is establishing a single source of truth for all features. This means maintaining one declarative feature registry that describes the logic, keys, freshness requirements, and both training and serving semantics for every feature your models consume. Without this, teams inevitably write separate implementations: data scientists build features in Python notebooks for training, while engineers rewrite them in Java or C++ for production, introducing subtle bugs at each translation.
Two Execution Modes
A production feature store provides two execution modes from one definition. The offline mode performs batch computation with time travel capabilities, allowing you to backfill features as they would have appeared at any historical timestamp. This ensures point in time correctness: when training on data from March 15th, you only use features available on March 15th, never leaking future information. The online mode materializes features into a low latency key value store with TTL and freshness SLAs, typically targeting p95 fetch latency under 5 milliseconds.
Production Examples
At Uber, this pattern powers features across pricing, fraud detection, and matching systems. A single feature definition like "rider 7 day trip count" computes identically whether you are building training data for last year or serving a real time prediction. Meta's feature store serves billions of feature reads per second for News Feed ranking, with the same transformation code running in Spark for training and in optimized C++ for serving.
Versioning Everything Together
The key is versioning everything together: feature definitions, vocabularies, transformation functions, and model artifacts all carry consistent version identifiers. When you deploy model version 47, it explicitly depends on feature registry version 23, ensuring the serving infrastructure loads exactly the transform logic that training used.

💡 Key Takeaways

✓Feature registry maintains one declarative definition supporting both offline batch computation (time travel, point in time correctness) and online materialization (low latency key value store with TTL)

✓Production Service Level Agreements (SLAs): p95 feature fetch latency under 5 milliseconds for critical features, aggregate fetch budgets 10 to 20 milliseconds, staleness under 60 seconds for real time aggregates

✓Version locking prevents divergence: model version 47 explicitly depends on feature registry version 23, ensuring serving loads exact transform logic used in training

✓Scale reference: Meta feature store serves billions of reads per second for feed ranking, Uber feature platform powers features across all prediction systems with single digit millisecond latency

✓Trade off: Platform investment and runtime constraints slow experimentation with bespoke transforms, but eliminate entire class of skew bugs and reduce incident response time from days to hours

📌 Interview Tips

1Uber feature store: "rider 7 day trip count" defined once, computes in Spark for training data backfills and materializes in Redis for real time trip matching with 3 millisecond p95 read latency

2Netflix user engagement features: watch history and genre preferences use identical transformation code in training (Spark) and serving (optimized C++), preventing skew in homepage personalization

3Google Ads bidding: advertiser spend aggregates defined in feature registry, computed batch for training, streamed to online store for serving with 1 minute freshness guarantee

← Back to Training-Serving Skew Prevention Overview