Operational Failure Modes in Production Feature Stores
Freshness Regressions
Insidious because they degrade model performance silently while p50 latency SLAs appear healthy. Streaming consumer lag from Kafka back pressure causes features to become stale by minutes to hours, but if the feature store still responds quickly with outdated values, latency monitoring misses the issue. A recommendation model serving 1 hour old activity counters instead of real time data might lose 5% to 15% CTR while all dashboards show green. Mitigation requires explicit freshness SLOs measuring age of last update per entity per feature.
Hot Key Problems
Emerge from power law traffic distributions where top 0.1% of entities receive 50%+ of requests, overwhelming individual shards in distributed key value stores. A viral video on TikTok can trigger cache stampedes where thousands of concurrent requests bypass cache simultaneously, crushing the backing database. DoorDash handles this through request coalescing: buffer requests for the same key arriving within milliseconds, issue single backend lookup, and broadcast result to all waiters. Negative caching for missing entities with short TTL prevents repeated lookups for non existent keys.
Partial Availability
When one feature store region goes down or a subset of features becomes unavailable due to upstream pipeline failures, the serving path must gracefully degrade rather than fail hard. Models should be architected with learned defaults or imputation at serving time, allowing prediction to continue with reduced accuracy rather than timing out. Netflix maintains per feature fallback values computed from recent population statistics, serving predictions with 2% to 5% accuracy degradation during regional outages rather than complete unavailability.
Consistency Gaps
Between online and offline stores manifest as experiment contamination and segment mismatches. Dual write races or CDC lag cause online state to temporarily diverge from offline by minutes to hours. A user assigned to treatment segment based on offline feature value might be served predictions using online features that still reflect control segment, polluting experiment results.