Feature Engineering & Feature StoresFeature Freshness & StalenessHard⏱️ ~3 min

Failure Modes: Silent Staleness and Training Serving Skew

Silent Staleness

Occurs when features appear fresh by pipeline metrics but are actually stale due to hidden issues. Clock skew between the host measuring "now" and the host that timestamped features can cause negative or understated age calculations. If feature timestamps come from a server 30 seconds ahead, measured age will be 30 seconds too low, letting stale features pass freshness checks. The fix is to compute age server side using monotonic clocks, store both event time and ingestion time, and enforce NTP discipline across infrastructure.

Cache Induced Staleness

Extends effective TTL beyond freshness SLAs when invalidation is delayed or missed. A feature computed at T=0 with SLA of 60 seconds, cached at T=30 with cache TTL of 120 seconds, may be served at T=150 with age of 150 seconds, violating SLA despite both feature materialization and caching operating correctly. Mitigation uses age aware cache keys that include computation timestamp, or piggybacks feature age onto cache entries for client side freshness checks.

Training Serving Skew from Freshness

Training pipelines typically use complete batch data with no staleness, while serving uses streaming features with variable freshness. A model trained on perfectly fresh features may degrade when served with 30 second old data. Mitigation injects synthetic staleness during training: randomly sample features from time T minus delta where delta follows the production freshness distribution, teaching the model robustness to stale inputs.

Pipeline Lag Masking

Aggregating lag metrics across all features hides per feature problems. A pipeline with 10 features where 9 have 5 second lag and 1 has 5 minute lag shows aggregate p95 of 5 seconds, masking the outlier. Per feature freshness monitoring is essential to catch isolated degradation.

💡 Key Takeaways
Clock skew causes negative or understated age. If the feature timestamp server is 30 seconds ahead, measured age is 30 seconds too low. Use server side age computation with monotonic clocks and strict NTP synchronization.
Multi tier cache TTLs compound. Three cache layers each with 60 second TTL create 180 second worst case staleness. Bound each tier's TTL to (feature_sla / number_of_tiers) to stay within SLA.
Training serving skew from label leakage inflates offline metrics by 10 to 20 AUC points but causes production failure. Enforce as of joins where features at time T use only data available before T minus operational delay.
Replication lag during peak load can stale features for geo routed users. Monitor Logical Sequence Number (LSN) or offset gaps between regions. When lag exceeds SLA, route reads to primary region despite higher latency.
Hot keys from viral content cause write retries and partition hotspots. One DoorDash restaurant during dinner rush generated 50 orders per minute, overwhelming a single partition. Sharding counters 10 ways and merging on read resolved it.
Backfill storms can overwrite fresh online values with old data. Route backfills to separate stores or namespaces. Use version numbers and max age guards: only write to online store if backfill_timestamp > current_online_timestamp.
📌 Interview Tips
1Uber discovered clock skew when prediction quality degraded but freshness dashboards showed green. Root cause was feature timestamps from a server 45 seconds fast, causing stale features to pass age checks. Fix was centralized server side age calculation.
2A LinkedIn feature for "profile views in last 24 hours" was trained using a simple join of labels and latest feature snapshots. Offline AUC was 0.83. Production AUC was 0.68 because training included views that happened after the label time. Switching to point in time joins fixed it.
3DoorDash hit a hot key issue where one store generated 3000 updates per hour. Write retries caused freshness to degrade from p99 of 30 seconds to p99 of 5 minutes. Sharding the counter into 10 keys and summing on read reduced p99 to 45 seconds.
← Back to Feature Freshness & Staleness Overview
Failure Modes: Silent Staleness and Training Serving Skew | Feature Freshness & Staleness - System Overflow