Feature Engineering & Feature Stores • Feature Freshness & StalenessHard⏱️ ~3 min
Production Implementation: Metadata, Tiering, and Capacity Planning
Production feature stores tag each feature with a freshness tier and numeric Service Level Agreement (SLA): realtime (p95 age under 5 seconds, p99 under 15 seconds), nearline (p95 under 5 minutes, p99 under 15 minutes), or batch (p95 under 24 hours, p99 under 48 hours). Each feature carries metadata: event time (when the underlying event occurred), last updated at (when the feature was computed and written), computation window (like 30 minute sliding window), soft TTL (warn threshold), and hard TTL (fallback threshold). This metadata enables the online feature assembler to compute age at request time, enforce SLAs, and degrade gracefully.
Capacity planning for freshness starts with latency budgets. If total percentile 99 (p99) inference latency is 50ms and model execution takes 30ms, you have 20ms for feature retrieval. Fetching 100 features from an online store with 5ms p99 read latency requires batching and prefetching. Uber batches lookups by entity (user, trip, driver) to reduce round trips, achieving 100 feature lookups in 10 to 15ms p99. For write capacity, estimate update Query Per Second (QPS): 100k entities with 1 update per minute average is 1,667 writes per second baseline, but provision for 5x to 10x burst factors to handle hotspots and peak hours.
Cross region freshness strategy is feature specific. For non critical features, read local with possible replication lag (seconds to minutes of staleness but low latency). For critical features, read from the primary write region to maximize freshness despite higher latency. Netflix uses regional caching for user embeddings (accepting minutes of staleness) but reads device and session state from the primary region within 20ms p99. Expose replication lag as a metric and choose dynamically: if lag is under 10 seconds and latency budget allows, read local; if lag spikes above threshold, route to primary.
Backfill and recomputation require separate lanes to avoid poisoning online freshness. When recomputing 90 days of features for a model retrain, route writes to a versioned offline store, not the online store. Only promote to online serving after validating correctness. Use version numbers or namespaces so backfills don't overwrite fresher values. DoorDash schedules heavy recomputations during off peak hours and throttles write rate to protect online serving traffic. For online store updates from backfills, apply a guard: only write if backfill timestamp is newer than the current online value.
💡 Key Takeaways
•Latency budgets force trade offs. With 50ms total p99 budget and 30ms model execution, feature retrieval has only 20ms. Batching 100 features by entity reduces round trips from 100 to 3, fitting in 15ms p99.
•Write capacity must handle burst factors of 5x to 10x, not just averages. Uber provisions nearline stores for p99 load during peak hours, which can be 10x average load during events like New Year's Eve.
•Cross region reads trade freshness for latency. Reading from primary region adds 50 to 150ms for cross continent latency but guarantees fresh data. Netflix reads embeddings locally (accepting 2 minute replication lag) but session state from primary.
•Backfills in separate lanes prevent online poisoning. DoorDash routes 90 day historical recomputations to versioned offline stores. Only after validation do they promote to online serving, with guards against overwriting fresher values.
•Feature metadata enables runtime decisions. Including last updated at and TTL lets the assembler substitute defaults, drop features, or include age as a model input when freshness SLAs are violated.
•Monitoring replication lag is critical for geo distributed systems. LinkedIn tracks offset deltas between regions per feature store partition. When lag exceeds 5 minutes, alert and route critical reads to primary.
📌 Examples
Uber Michelangelo batches feature lookups by entity type. For a trip prediction, it fetches all rider features in one lookup (10ms), all driver features in another (8ms), and contextual features in a third (5ms), totaling 23ms p99 for 100+ features.
Netflix maintains two tiers of feature storage: regional read replicas for user embeddings with 2 to 5 minute replication lag and primary region lookups for session state with 10ms p99 latency, choosing based on criticality.
DoorDash discovered a backfill job overwrote fresh store busy signals with 3 hour old values during a nightly recomputation. Adding a version check (only write if new_version > current_version) prevented the regression.