Feature Engineering & Feature Stores • Point-in-Time CorrectnessMedium⏱️ ~3 min
Trading Off Storage Cost, Freshness, and PIT Guarantees
Achieving Point in Time (PIT) correctness requires explicit trade offs between storage cost, feature freshness, and correctness guarantees. Maintaining historical feature versions amplifies storage 1.5 to 3 times versus current state only tables, with cost scaling linearly with retention window (7 to 90 days) and update churn rate. High frequency features updated every second cost 10 to 100 times more to version than daily batch features due to log growth and compaction overhead.
Freshness versus safety creates operational tension. Aggressively applying late arriving corrections to online stores can destabilize real time models when retroactive changes shift feature distributions. Many teams adopt a dual correctness model: online store is eventually correct using last write wins with event time guards, while offline store is fully correct with backfilled late data. This accepts that online predictions use slightly stale or incomplete data (p99 age 1 to 5 minutes) in exchange for stability, while training gets the complete corrected history.
Complexity versus generality limits feature expressiveness. Fixed time windows like 7, 30, 90 days with standard aggregations like count, sum, mean are easier to make PIT correct and cache at low latency. Arbitrary Python user defined functions (UDFs) or complex transformations increase leakage risk and prevent caching, raising serving latency from 10 milliseconds to 100 plus milliseconds. Teams constrain feature definitions to a declarative domain specific language (DSL) to encode PIT guarantees and enable optimization.
You can relax PIT correctness for purely cross sectional problems with truly instantaneous labels and static features, or exploratory prototyping where leakage is acceptable. But any production supervised learning with time dependent features, delayed labels (fraud, recommendations, ads), or streaming data requires strict PIT enforcement. The cost is 1.5 to 4 times higher compute for temporal joins and 2 to 3 times storage, recovered through improved model accuracy (5 to 20 percent) and reproducibility.
💡 Key Takeaways
•Storage amplification of 1.5 to 3 times versus current state tables scales with retention window (7 to 90 days) and update frequency, with high churn features costing 10 to 100 times more than batch
•Dual correctness model trades real time freshness for stability: online eventually correct with p99 age 1 to 5 minutes, offline fully correct with backfills applied retroactively
•Constraining features to declarative DSL with fixed windows and standard aggregations enables PIT guarantees and caching for 10 millisecond latency versus 100 plus millisecond UDFs
•Temporal as of joins cost 1.5 to 4 times more compute than naive latest value joins at 100 million plus row scale due to partitioning, sorting, and windowing overhead
•Acceptable to relax PIT for cross sectional problems with instantaneous labels and static features, or exploratory prototyping, but production time dependent models require strict enforcement
•Improved model accuracy of 5 to 20 percent and reproducibility for audits and rollbacks justify the 2 to 3 times storage and compute cost in production ML systems
📌 Examples
Uber Palette: Uses 30 day retention for rapid experimentation, 90 day for regulated ride safety models. High frequency GPS features use aggressive compaction while daily batch features use simple append only logs
Netflix recommendation: Online serves from cache with p99 5 minute feature age for stability, offline training uses backfilled complete history with late arrivals corrected up to 24 hours after original event
Feature DSL example: user.click_count(window=7d, aggregation=sum) is PIT safe and cacheable at 5ms p99, versus arbitrary Python lambda that recomputes on each request at 50 to 100ms latency