Learn→Feature Engineering & Feature Stores→Feature Store Architecture (Feast, Tecton, Hopsworks)→4 of 6

Feature Engineering & Feature Stores • Feature Store Architecture (Feast, Tecton, Hopsworks)Hard⏱️ ~2 min

Online Serving Architecture and Latency Budgets

Latency Budget
Online feature serving must return tens to hundreds of features per entity within single digit milliseconds at high QPS to fit inference SLAs. A typical ranking service fetches 50 features per entity at 20,000 requests per second, yielding 1 million feature reads per second. With a 50ms end to end SLA, the feature budget is often 10 to 15ms p99, leaving room for model inference and network hops. Netflix achieves sub millisecond p50 latencies by using EVCache deployed in the same region, serving millions of reads per second globally.
The Serving Path
Starts with co location: place feature services in the same AZ as model servers to eliminate 5 to 15ms cross AZ penalties. Batch reads using multi get APIs to fetch 50 keys in one round trip instead of 50 serial requests, amortizing network overhead from 50ms total to 5ms. Cache hot features in process or in a sidecar with 10 to 30 second TTL to absorb 80 to 95 percent of reads; this reduces key value load by 10x and ensures p50 latencies under 1ms for cached paths. For the remaining 5 to 20 percent cache miss traffic, the regional key value store handles reads in 3 to 8ms p99.
Hot Key Mitigation
Popular entities (trending content, global feeds) create hotspots that spike p99 latency or trigger throttling. Solutions include salting keys with random suffixes to spread load, per entity rate limits to protect the store, pre materializing aggregates for top N entities, and short TTL caching for viral keys. LinkedIn's Venice derived data store uses read replicas and sharding strategies to handle millions of QPS for People You May Know features with single digit millisecond p99.
Failure Modes
TTL expiry causing silent fallback to default values (degrading model quality), and cross region replication lag leading to stale reads. Aggressive TTLs of 5 minutes may cut cache hit rates below 70 percent, doubling key value load and blowing latency budgets. Too long TTLs of 6 hours violate freshness SLOs for dynamic features. The mitigation is per feature freshness SLOs with alerting when age of last update exceeds thresholds.

💡 Key Takeaways

✓At 20,000 requests per second fetching 50 features each, you serve 1 million feature reads per second; with a 50 millisecond end to end Service Level Agreement, feature budget is 10 to 15 millisecond p99 including network and serialization

✓Co location in the same Availability Zone eliminates 5 to 15 millisecond cross AZ penalties; batch reads with multi get fetch 50 keys in one 5 millisecond round trip instead of 50 serial requests totaling 50 milliseconds

✓In process or sidecar caches with 10 to 30 second Time To Live absorb 80 to 95 percent of reads at sub 1 millisecond p50, reducing key value load by 10 times; remaining 5 to 20 percent cache misses hit regional key value at 3 to 8 millisecond p99

✓Hot key mitigation: salting popular entity keys spreads load, per entity rate limits prevent throttling, pre materialization handles top N entities, and short TTL caching absorbs viral traffic spikes

✓Failure modes: aggressive 5 minute TTLs drop cache hit rates below 70 percent and double key value load; stale features from expiry silently degrade model quality by 1 to 2 percent without alerting on freshness Service Level Objectives

📌 Interview Tips

1Netflix uses EVCache deployed in multiple regions as an in memory online feature store, achieving sub millisecond p50 and low single digit millisecond p99 latencies while serving millions of reads per second for personalization features

2LinkedIn Venice powers People You May Know features with read replicas and sharding to handle millions of Queries Per Second, maintaining single digit millisecond p99 through region local reads and hot key distribution

← Back to Feature Store Architecture (Feast, Tecton, Hopsworks) Overview