Feature Engineering & Feature Stores • Feature Sharing & DiscoveryHard⏱️ ~3 min
Online Feature Serving: Latency Budgets and Scale
Online feature serving is the critical path in inference: if end to end prediction SLA is 100 milliseconds and model compute uses 25 to 40 milliseconds, you have only 10 to 25 milliseconds at p95 for feature fetches, network overhead, and request coalescing. Missing this budget causes timeouts, fallback to degraded models, and user visible latency spikes. Serving 100K requests per second globally with 10 features per entity and 95 percent cache hit rate still means 5K requests per second hitting the online store at 1 KB per feature bundle, or roughly 5 MB per second per region plus replication overhead. Headroom for backfills and failures requires at least 2x capacity.
The primary optimization is pre materialization: compute and store feature bundles keyed by entity in a low latency key value store, co located with inference services or deployed at edge points of presence. Netflix and Airbnb group features by entity to minimize multi key fetches; a single lookup retrieves 50 to 200 features for a user or item. This avoids the N plus 1 query problem. Hot keys are a constant threat: a few entities like popular items or viral content dominate traffic and create shard hotspots that inflate p99 latency. Mitigations include load aware sharding, replicating hot partitions, per key rate limiting, and lazy materialization with backpressure.
Caching is deployed in tiers. Request scoped cache deduplicates fetches within a single inference call. Process level Least Recently Used (LRU) cache holds hot entities in memory with Time to Live (TTL) aligned to feature freshness; pre warm for known hot entities before traffic spikes. Regional cache offloads the online store. At Uber scale, streaming feature updates must propagate with sub minute freshness while maintaining exactly once semantics and watermarking for late data, adding operational complexity. Fallback policies are essential: if a fetch exceeds 20 milliseconds or fails, serve last known good features, use population priors, or switch to a simpler model variant. Log all fallbacks and track impact on Click Through Rate (CTR) or conversion metrics to quantify degradation.
💡 Key Takeaways
•Latency budget constraint: 100ms end to end SLA with 25 to 40ms model compute leaves only 10 to 25ms p95 for feature fetches; missing this causes timeouts and user visible latency spikes
•Scale math: 100K RPS globally with 10 features per entity and 95 percent cache hit rate generates 5K RPS to online store at 1 KB per bundle, roughly 5 MB per second per region plus replication; requires 2x headroom
•Pre materialization and entity coalescing: group 50 to 200 features by entity in low latency key value store, single lookup avoids N plus 1 query problem, co locate with inference services or edge POPs
•Hot key mitigation: few entities dominate traffic and create shard hotspots inflating p99; use load aware sharding, replicate hot partitions, per key rate limiting, lazy materialization with backpressure
•Tiered caching: request scoped cache deduplicates within inference call, process level LRU with TTL aligned to feature freshness, regional cache offloads store, pre warm for known hot entities before spikes
•Fallback policies: if fetch exceeds 20ms or fails, serve last known good features, population priors, or switch to simpler model; log fallbacks and track CTR or conversion impact to quantify degradation
📌 Examples
Airbnb search ranking targets sub 100ms end to end; allocates low tens of milliseconds p95 to feature retrieval by pre materializing user and listing features and coalescing 50 to 200 features per entity in one fetch
Netflix serves single digit to low tens of milliseconds p95 feature lookups for personalization models by grouping features by user and caching at request, process, and regional tiers with TTL aligned to freshness
Uber Michelangelo handles millions of events per minute for streaming feature updates with sub minute freshness; uses exactly once semantics, watermarking for late data, and load aware sharding to avoid hot key spikes
LinkedIn Venice provides single digit millisecond online reads for feed ranking by pre warming hot entities, replicating hot partitions, and falling back to last known good features if fetches exceed 15ms p95