Learn→ML-Powered Search & Ranking→Scalability (Sharding, Caching, Approximate Search)→2 of 5

ML-Powered Search & Ranking • Scalability (Sharding, Caching, Approximate Search)Medium⏱️ ~2 min

Multi Tier Caching for ML Feature Stores and Embeddings

Caching in ML systems prevents repeated expensive operations like embedding computation, feature fetches from slow storage, and model inference. A well designed cache hierarchy can reduce median latency from 15 ms to 2 ms and P99 from 80 ms to 12 ms, directly impacting whether you meet service level objectives (SLOs) at scale. The challenge is balancing hit rate, consistency, and operational complexity across multiple tiers.

Typical production architectures use three cache layers. Application tier caches hold user embeddings and recent recommendation results with short time to live (TTL) values of 10 to 120 seconds, keyed by user ID and model version. This layer catches bursty repeated requests from the same user. Feature store caches maintain a write through in memory layer for hot features, guaranteeing read after write consistency, with TTLs tuned to freshness requirements, often 5 to 30 minutes. Cold feature reads fall back to SSD backed stores at 5 to 15 ms. Model or embedding caches store precomputed vectors for the top 50 million active users, refreshed every 5 minutes in batch, avoiding real time computation for 99 percent of traffic.

The main strategies differ in consistency and performance. Cache aside is simple: check cache first, on miss fetch from source and populate cache. This minimizes write amplification but suffers cold start misses and cache stampedes when hot keys expire. Write through updates cache and source together, providing strong read after write semantics, but adds write path latency and increases cache churn. Write back batches writes in the cache and flushes to the source asynchronously, reducing database write load but risking data loss on cache node failure.

In practice, Meta and Google report targeting above 95 percent hit rates for hot user and item features to control P99. For a feature store serving 120k QPS, a 95 percent hit rate at 2 ms cache latency and 5 percent misses at 12 ms gives a weighted average of roughly 2.5 ms, well within budget. Dropping to 90 percent hit rate pushes the average to 3.2 ms and risks breaching SLOs during traffic spikes. Monitor hit rate by key class because user features and item features often show different access patterns, and tune TTLs independently.

💡 Key Takeaways

•Multi tier caching with application, feature store, and embedding caches reduces median latency from 15 ms to 2 ms and P99 from 80 ms to 12 ms when hit rates exceed 95 percent.

•Cache aside minimizes write amplification and is simple but suffers cold start and stampede risk. Write through guarantees consistency but adds write latency. Write back reduces database load but risks data loss.

•Production systems target above 95 percent hit rate for hot features. Dropping to 90 percent can increase median latency by 30 percent and breach SLOs during traffic spikes.

•Use single flight per key or request coalescing to prevent cache stampedes where thousands of requests simultaneously miss a hot key and overwhelm the backend. Add jittered TTLs so keys do not expire simultaneously.

•Precompute and cache embeddings for the top 50 million active users every 5 minutes to avoid real time computation for 99 percent of requests. Cold users compute on demand with 10 to 20 ms penalty.

📌 Examples

Meta FAISS based retrieval precomputes user embeddings for active users and caches them with 5 minute TTL. Cache hits return in 1 to 2 ms, misses compute in 10 to 20 ms, achieving 99 percent hit rate and sub 5 ms P50.

Airbnb feature store uses write through memory cache for user profile features with 10 minute TTL. This ensures guests see updated preferences immediately after changes while keeping read latency under 3 ms for 96 percent of requests.

Amazon product search caches item embeddings and metadata in a distributed cache with Least Recently Used (LRU) eviction. During flash sales, hot product cache hit rate drops from 98 to 85 percent, causing P99 spikes until auto scaling kicks in.

← Back to Scalability (Sharding, Caching, Approximate Search) Overview