Online Serving Architecture and Latency Budgets
Latency Budget
Online feature serving must return tens to hundreds of features per entity within single digit milliseconds at high QPS to fit inference SLAs. A typical ranking service fetches 50 features per entity at 20,000 requests per second, yielding 1 million feature reads per second. With a 50ms end to end SLA, the feature budget is often 10 to 15ms p99, leaving room for model inference and network hops. Netflix achieves sub millisecond p50 latencies by using EVCache deployed in the same region, serving millions of reads per second globally.
The Serving Path
Starts with co location: place feature services in the same AZ as model servers to eliminate 5 to 15ms cross AZ penalties. Batch reads using multi get APIs to fetch 50 keys in one round trip instead of 50 serial requests, amortizing network overhead from 50ms total to 5ms. Cache hot features in process or in a sidecar with 10 to 30 second TTL to absorb 80 to 95 percent of reads; this reduces key value load by 10x and ensures p50 latencies under 1ms for cached paths. For the remaining 5 to 20 percent cache miss traffic, the regional key value store handles reads in 3 to 8ms p99.
Hot Key Mitigation
Popular entities (trending content, global feeds) create hotspots that spike p99 latency or trigger throttling. Solutions include salting keys with random suffixes to spread load, per entity rate limits to protect the store, pre materializing aggregates for top N entities, and short TTL caching for viral keys. LinkedIn's Venice derived data store uses read replicas and sharding strategies to handle millions of QPS for People You May Know features with single digit millisecond p99.
Failure Modes
TTL expiry causing silent fallback to default values (degrading model quality), and cross region replication lag leading to stale reads. Aggressive TTLs of 5 minutes may cut cache hit rates below 70 percent, doubling key value load and blowing latency budgets. Too long TTLs of 6 hours violate freshness SLOs for dynamic features. The mitigation is per feature freshness SLOs with alerting when age of last update exceeds thresholds.