Training Infrastructure & PipelinesTraining-Serving Skew PreventionMedium⏱️ ~3 min

Single Source of Truth: Unified Feature Definitions

The foundational strategy for preventing training serving skew is establishing a single source of truth for all features. This means maintaining one declarative feature registry that describes the logic, keys, freshness requirements, and both training and serving semantics for every feature your models consume. Without this, teams inevitably write separate implementations: data scientists build features in Python notebooks for training, while engineers rewrite them in Java or C++ for production, introducing subtle bugs at each translation. A production feature store provides two execution modes from one definition. The offline mode performs batch computation with time travel capabilities, allowing you to backfill features as they would have appeared at any historical timestamp. This ensures point in time correctness: when training on data from March 15th, you only use features available on March 15th, never leaking future information. The online mode materializes features into a low latency key value store with Time To Live (TTL) and freshness Service Level Agreements (SLAs), typically targeting p95 fetch latency under 5 milliseconds and staleness under 60 seconds for real time aggregates. At Uber, this pattern powers features across pricing, fraud detection, and matching systems. A single feature definition like "rider 7 day trip count" computes identically whether you're building training data for last year or serving a real time prediction. Meta's feature store serves billions of feature reads per second for News Feed ranking, with the same transformation code running in Spark for training and in optimized C++ for serving. Netflix uses similar architecture to ensure that user engagement features (watch history, genre preferences) match exactly between model training and homepage personalization. The key is versioning everything together: feature definitions, vocabularies, transformation functions, and model artifacts all carry consistent version identifiers. When you deploy model version 47, it explicitly depends on feature registry version 23, ensuring the serving infrastructure loads exactly the transform logic that training used. This prevents silent divergence when someone updates a feature definition but forgets to retrain the model.
💡 Key Takeaways
Feature registry maintains one declarative definition supporting both offline batch computation (time travel, point in time correctness) and online materialization (low latency key value store with TTL)
Production Service Level Agreements (SLAs): p95 feature fetch latency under 5 milliseconds for critical features, aggregate fetch budgets 10 to 20 milliseconds, staleness under 60 seconds for real time aggregates
Version locking prevents divergence: model version 47 explicitly depends on feature registry version 23, ensuring serving loads exact transform logic used in training
Scale reference: Meta feature store serves billions of reads per second for feed ranking, Uber feature platform powers features across all prediction systems with single digit millisecond latency
Trade off: Platform investment and runtime constraints slow experimentation with bespoke transforms, but eliminate entire class of skew bugs and reduce incident response time from days to hours
📌 Examples
Uber feature store: "rider 7 day trip count" defined once, computes in Spark for training data backfills and materializes in Redis for real time trip matching with 3 millisecond p95 read latency
Netflix user engagement features: watch history and genre preferences use identical transformation code in training (Spark) and serving (optimized C++), preventing skew in homepage personalization
Google Ads bidding: advertiser spend aggregates defined in feature registry, computed batch for training, streamed to online store for serving with 1 minute freshness guarantee
← Back to Training-Serving Skew Prevention Overview
Single Source of Truth: Unified Feature Definitions | Training-Serving Skew Prevention - System Overflow