ML Infrastructure & MLOpsFeature Store IntegrationHard⏱️ ~2 min

Serving Flow: Assembly, Latency Budgets, and Caching

When a prediction request arrives, the model service resolves one or more entity keys and performs a batched multi get to fetch features from the online store. Ads and recommendation systems target end to end latencies under 100 ms, so the feature store typically has 5 to 20 ms at p99 to return the feature vector, with p50 under 5 ms. At 100 thousand QPS and 2 KB per vector, read throughput is roughly 200 MB per second per region. Meeting this requires parallelized I/O, async or pipelined reads to keep compute busy, and minimizing remote call count to ideally one batched request. Caching is essential. Maintain a small in process cache with 95 percent hit rate for ultra hot keys. Use short TTLs that match freshness requirements, typically 60 to 300 seconds. Add request coalescing to collapse many same key reads arriving under high fan-in, reducing load on the backing store. For global services, deploy per region online stores with independent write paths to avoid cross region hops on the critical serving path. Fallback strategies handle unavailability. If the online store is down or a key is missing, serve last known values cached locally or default safe values. Track a fallback rate metric and set a Service Level Objective (SLO), for example under 0.5 percent. Missing features degrade model quality, so monitor null rates and alert when they spike. Real systems use multi layer caching: in process for microsecond access, regional distributed cache for single digit milliseconds, and the online store as the source of truth.
💡 Key Takeaways
Feature store must return vectors in 5 to 20 ms at p99, with p50 under 5 ms, within end to end prediction budgets under 100 ms for ads and recommendations
At 100 thousand QPS and 2 KB per vector, read throughput reaches 200 MB per second per region, requiring parallelized async I/O and minimizing remote calls to one batched request
In process caching with 95 percent hit rate and 60 to 300 second TTL reduces load, with request coalescing collapsing duplicate same key reads
Multi layer caching provides microsecond in process access, single digit millisecond regional cache, and online store as source of truth
Fallback to last known or default values when keys are missing, with fallback rate SLO under 0.5 percent to avoid silent model quality degradation
📌 Examples
Netflix uses regional caches and in service storage to achieve p99 reads under tens of milliseconds at millions of QPS for personalization features
An ads ranking system might cache the top 50 thousand advertiser feature vectors in process with 120 second TTL, achieving 97 percent hit rate and sub millisecond latency for hot keys
Uber's Michelangelo deploys per region online stores with independent write paths, avoiding cross region network hops that would add 50 to 150 ms to critical path latency
← Back to Feature Store Integration Overview
Serving Flow: Assembly, Latency Budgets, and Caching | Feature Store Integration - System Overflow