Per Feature Age Distributions
Effective freshness monitoring requires tracking per feature age distributions, not just pipeline success metrics. A batch job marked "succeeded" can still deliver stale features if upstream data was delayed or if the job processed only a subset of entities. Teams must emit histograms of feature age (p50, p95, p99) for each feature and alert when percentiles exceed soft or hard TTL thresholds. DoorDash monitors both end to end lag (event time to availability in online store) and per entity freshness to catch partial failures.
Staleness Aware Serving
Implements graceful degradation through a fallback cascade. When a feature age exceeds its soft TTL, the system logs a warning and includes an "age" feature or downweights its contribution. When age exceeds hard TTL, the system falls back to a default value (population mean, last known good, or zero) and increments an alert counter. Netflix uses learned imputation where the model predicts missing feature values from available features.
Freshness Alerts
Configure alerts on p95 and p99 feature age crossing SLA thresholds sustained for 5 to 15 minutes to avoid flapping on transient spikes. Include both absolute staleness (feature is 10 minutes old) and relative staleness (feature is 2x older than historical p95). Relative thresholds catch gradual degradation that absolute thresholds miss.
Dashboard Design
Visualize feature freshness as a heatmap showing (feature, time bucket) with color indicating age percentile. Red cells indicate SLA violation. Drill down to per entity age distributions to identify if staleness is global (pipeline issue) or localized (hot key, partition issue).
✓Monitor feature age distributions, not just job success. A job can succeed while delivering features that are hours stale if upstream data was delayed. Track p50, p95, p99 age per feature hourly.
✓Canary monitoring catches issues invisible to batch dashboards. Uber runs synthetic prediction requests every minute for test entities and alerts if 3 consecutive requests show features older than SLA.
✓Smart fallbacks reduce error significantly. Uber experiments showed that falling back to 1 hour old batch values when nearline features exceed TTL reduces prediction Mean Absolute Error (MAE) by 8 to 12% versus dropping features.
✓Training on artificially staled features reveals sensitivity. If offline AUC drops from 0.85 to 0.78 when features are 2x their target age, either tighten freshness SLA or make the model more robust by including age as input.
✓Replication lag can make features stale for geo routed traffic. LinkedIn monitors cross region replication offsets and exposes lag as a freshness signal. If lag exceeds 5 minutes, read from primary region despite higher latency.
✓Backfill storms can overwrite fresh values with old data. Route backfills to separate namespaces and gate online replacement using version numbers and max age guards to prevent hot key eviction.
1DoorDash detected a silent staleness bug where store busy features appeared fresh (job succeeded) but covered only 60% of entities due to upstream Kafka partition lag. Per entity age monitoring caught this within 10 minutes.
2Netflix trained two model variants: one with all features, one with only low volatility features. When freshness SLAs are violated systemically (upstream outage), traffic shifts to the robust variant, degrading recommendations slightly but preventing total failure.
3LinkedIn's canary system requests features for 1000 test profiles every 30 seconds. When replication lag spiked to 10 minutes during a datacenter issue, canaries alerted before users noticed, and traffic was routed to the primary region.