Online Feature Serving: Latency Budgets and Scale

The Latency Budget Constraint
Online feature serving is the critical path in inference: if end to end prediction SLA is 100 milliseconds and model compute uses 25 to 40 milliseconds, you have only 10 to 25 milliseconds at p95 for feature fetches, network overhead, and request coalescing. Missing this budget causes timeouts, fallback to degraded models, and user visible latency that impacts conversion.
Scaling Dimensions
Scale along three axes: QPS (requests per second), feature count per request, and entity count per request. A recommendation system fetching 50 features for 100 candidate items at 10,000 QPS requires 50 million feature reads per second. Achieving this at sub 10ms p95 demands aggressive optimization.
Optimization Patterns
Batch reads using multi get APIs to fetch many keys in one round trip. Feature vector bundling stores all features for an entity in a single key, reducing 50 lookups to 1. Caching hot entities in application memory with 10 to 30 second TTL absorbs 80 to 95 percent of reads. Precomputation materializes expensive derived features at write time rather than read time.
Colocation Strategy
Deploy feature stores in the same availability zone as model servers to eliminate cross AZ latency (5 to 15ms penalty). For global systems, replicate feature stores to each region serving traffic.
Graceful Degradation
When feature stores become slow or unavailable, fall back to default values rather than timing out. Models should be trained with occasional missing features to remain robust. Monitor fallback rates: sustained rates above 1 percent indicate infrastructure problems requiring attention.

💡 Key Takeaways

✓Latency budget constraint: 100ms end to end SLA with 25 to 40ms model compute leaves only 10 to 25ms p95 for feature fetches; missing this causes timeouts and user visible latency spikes

✓Scale math: 100K RPS globally with 10 features per entity and 95 percent cache hit rate generates 5K RPS to online store at 1 KB per bundle, roughly 5 MB per second per region plus replication; requires 2x headroom

✓Pre materialization and entity coalescing: group 50 to 200 features by entity in low latency key value store, single lookup avoids N plus 1 query problem, co locate with inference services or edge POPs

✓Hot key mitigation: few entities dominate traffic and create shard hotspots inflating p99; use load aware sharding, replicate hot partitions, per key rate limiting, lazy materialization with backpressure

✓Tiered caching: request scoped cache deduplicates within inference call, process level LRU with TTL aligned to feature freshness, regional cache offloads store, pre warm for known hot entities before spikes

✓Fallback policies: if fetch exceeds 20ms or fails, serve last known good features, population priors, or switch to simpler model; log fallbacks and track CTR or conversion impact to quantify degradation

📌 Interview Tips

1Airbnb search ranking targets sub 100ms end to end; allocates low tens of milliseconds p95 to feature retrieval by pre materializing user and listing features and coalescing 50 to 200 features per entity in one fetch

2Netflix serves single digit to low tens of milliseconds p95 feature lookups for personalization models by grouping features by user and caching at request, process, and regional tiers with TTL aligned to freshness

3Uber Michelangelo handles millions of events per minute for streaming feature updates with sub minute freshness; uses exactly once semantics, watermarking for late data, and load aware sharding to avoid hot key spikes

4LinkedIn Venice provides single digit millisecond online reads for feed ranking by pre warming hot entities, replicating hot partitions, and falling back to last known good features if fetches exceed 15ms p95

← Back to Feature Sharing & Discovery Overview