Feature Engineering & Feature Stores • Feature Freshness & StalenessEasy⏱️ ~2 min
What is Feature Freshness and Why Does It Matter?
Feature freshness is the age of a feature value relative to when it is used for prediction, calculated as the current time minus the event time that produced the feature. When this age exceeds an agreed Service Level Agreement (SLA), the feature is considered stale. For example, if a fraud detection feature showing "number of transactions in last 5 minutes" was computed 3 minutes ago, its freshness is 3 minutes.
Freshness requirements vary dramatically by use case. Fraud signals and live inventory at companies like Uber must have percentile 95 (p95) freshness under 5 to 10 seconds because stale data leads to incorrect pricing or fraud going undetected. In contrast, user embeddings or long term purchase history at Netflix can tolerate 24 hour staleness since they capture stable patterns. The key insight is that freshness is a product metric tied to business impact, not just a pipeline detail.
The architectural choice between precompute and serve versus compute on demand directly controls the trade off between freshness, latency, and cost. Precomputing features and caching them delivers low latency lookups (single digit milliseconds) but risks staleness between refresh cycles. Computing features at request time maximizes freshness but consumes precious latency budget and requires more compute resources. Most production systems use a hybrid approach, precomputing stable features while computing high volatility signals on demand.
Companies enforce freshness through tiered SLAs. DoorDash targets p95 under 60 seconds for operational features like store busy state and under 10 seconds during demand spikes. LinkedIn's Venice derived data store delivers single digit millisecond p99 read latency with nearline updates completing in seconds to minutes. These concrete targets drive infrastructure decisions about streaming pipelines, online stores, and fallback strategies.
💡 Key Takeaways
•Freshness is calculated as now minus event time, not processing time. A feature computed from a 10 minute old event is 10 minutes stale even if computation just finished.
•Uber marketplace inference runs at over 100k Queries Per Second (QPS) globally during peaks with only 20 to 50ms total prediction budget, leaving 5 to 15ms p99 for feature retrieval.
•Staleness harms business metrics measurably. DoorDash found that delivery time predictions degrade significantly when store busy state features exceed 60 seconds of age during peak hours.
•Most production systems define three tiers: realtime (p95 under 5 seconds), nearline (p95 under 5 minutes), and batch (p95 under 24 hours) with different infrastructure for each.
•Monitoring must track distributions, not averages. A p50 freshness of 2 seconds with p99 of 5 minutes means 1% of predictions use critically stale data, causing bad user experiences.
•Freshness requirements should be validated through A/B testing. Netflix only pushes features to real time infrastructure when experiments prove that reducing staleness improves Click Through Rate (CTR) or engagement.
📌 Examples
Uber dynamic pricing uses features like nearby driver supply with p95 freshness under 10 seconds. If this goes stale by 5 minutes during rush hour, surge multipliers become inaccurate and drivers are misallocated.
LinkedIn feed ranking combines daily batch user embeddings (24 hour staleness acceptable) with nearline engagement signals (updated within 60 seconds) to balance freshness and cost.
Netflix homepage ranking accepts 24 hour staleness for heavy recommendation embeddings while computing context features like time of day and device type at request time for sub 50ms latency.