Feature Engineering & Feature StoresFeature Store Architecture (Feast, Tecton, Hopsworks)Hard⏱️ ~2 min

Online Serving Architecture and Latency Budgets

Online feature serving must return tens to hundreds of features per entity within single digit milliseconds at high Queries Per Second (QPS) to fit inference Service Level Agreements (SLAs). A typical ranking service fetches 50 features per entity at 20,000 requests per second, yielding 1 million feature reads per second. With a 50 millisecond end to end SLA, the feature budget is often 10 to 15 milliseconds p99, leaving room for model inference and network hops. Netflix achieves sub millisecond p50 latencies by using EVCache (in memory cache tier) deployed in the same region, serving millions of reads per second globally. Hopsworks RonDB benchmarks show hundreds of microseconds to low milliseconds at 100,000 to 1 million operations per second per cluster with co location. The serving path starts with co location: place feature services in the same Availability Zone (AZ) as model servers to eliminate 5 to 15 millisecond cross AZ penalties. Batch reads using multi get Application Programming Interfaces (APIs) to fetch 50 keys in one round trip instead of 50 serial requests, amortizing network overhead from 50 milliseconds total to 5 milliseconds. Cache hot features in process or in a sidecar with 10 to 30 second Time To Live (TTL) to absorb 80 to 95 percent of reads; this reduces key value load by 10 times and ensures p50 latencies under 1 millisecond for cached paths. For the remaining 5 to 20 percent cache miss traffic, the regional key value store handles reads in 3 to 8 milliseconds p99 when properly sharded. Hot key mitigation is critical. Popular entities (trending content, global feeds) create hotspots that spike p99 latency or trigger throttling. Solutions include salting keys with random suffixes to spread load, per entity rate limits to protect the store, pre materializing aggregates for top N entities, and short TTL caching for viral keys. LinkedIn's Venice derived data store uses read replicas and sharding strategies to handle millions of QPS for People You May Know features with single digit millisecond p99. Failure modes include Time To Live expiry causing silent fallback to default values (degrading model quality), and cross region replication lag leading to stale reads. Aggressive TTLs of 5 minutes may cut cache hit rates below 70 percent, doubling key value load and blowing latency budgets. Too long TTLs of 6 hours violate freshness Service Level Objectives (SLOs) for dynamic features. The mitigation is per feature freshness SLOs with alerting when age of last update exceeds thresholds, and A/B tests validating that default values do not silently degrade metrics by more than 1 to 2 percent.
💡 Key Takeaways
At 20,000 requests per second fetching 50 features each, you serve 1 million feature reads per second; with a 50 millisecond end to end Service Level Agreement, feature budget is 10 to 15 millisecond p99 including network and serialization
Co location in the same Availability Zone eliminates 5 to 15 millisecond cross AZ penalties; batch reads with multi get fetch 50 keys in one 5 millisecond round trip instead of 50 serial requests totaling 50 milliseconds
In process or sidecar caches with 10 to 30 second Time To Live absorb 80 to 95 percent of reads at sub 1 millisecond p50, reducing key value load by 10 times; remaining 5 to 20 percent cache misses hit regional key value at 3 to 8 millisecond p99
Hot key mitigation: salting popular entity keys spreads load, per entity rate limits prevent throttling, pre materialization handles top N entities, and short TTL caching absorbs viral traffic spikes
Failure modes: aggressive 5 minute TTLs drop cache hit rates below 70 percent and double key value load; stale features from expiry silently degrade model quality by 1 to 2 percent without alerting on freshness Service Level Objectives
📌 Examples
Netflix uses EVCache deployed in multiple regions as an in memory online feature store, achieving sub millisecond p50 and low single digit millisecond p99 latencies while serving millions of reads per second for personalization features
LinkedIn Venice powers People You May Know features with read replicas and sharding to handle millions of Queries Per Second, maintaining single digit millisecond p99 through region local reads and hot key distribution
← Back to Feature Store Architecture (Feast, Tecton, Hopsworks) Overview
Online Serving Architecture and Latency Budgets | Feature Store Architecture (Feast, Tecton, Hopsworks) - System Overflow