Failure Modes: Silent Staleness and Training Serving Skew
Silent Staleness
Occurs when features appear fresh by pipeline metrics but are actually stale due to hidden issues. Clock skew between the host measuring "now" and the host that timestamped features can cause negative or understated age calculations. If feature timestamps come from a server 30 seconds ahead, measured age will be 30 seconds too low, letting stale features pass freshness checks. The fix is to compute age server side using monotonic clocks, store both event time and ingestion time, and enforce NTP discipline across infrastructure.
Cache Induced Staleness
Extends effective TTL beyond freshness SLAs when invalidation is delayed or missed. A feature computed at T=0 with SLA of 60 seconds, cached at T=30 with cache TTL of 120 seconds, may be served at T=150 with age of 150 seconds, violating SLA despite both feature materialization and caching operating correctly. Mitigation uses age aware cache keys that include computation timestamp, or piggybacks feature age onto cache entries for client side freshness checks.
Training Serving Skew from Freshness
Training pipelines typically use complete batch data with no staleness, while serving uses streaming features with variable freshness. A model trained on perfectly fresh features may degrade when served with 30 second old data. Mitigation injects synthetic staleness during training: randomly sample features from time T minus delta where delta follows the production freshness distribution, teaching the model robustness to stale inputs.
Pipeline Lag Masking
Aggregating lag metrics across all features hides per feature problems. A pipeline with 10 features where 9 have 5 second lag and 1 has 5 minute lag shows aggregate p95 of 5 seconds, masking the outlier. Per feature freshness monitoring is essential to catch isolated degradation.