ML-Powered Search & RankingScalability (Sharding, Caching, Approximate Search)Medium⏱️ ~2 min

Multi-Tier Caching for Features and Embeddings

Definition
Multi-tier caching layers multiple cache levels—in-process, distributed, and persistent—to serve ML features and embeddings with sub-millisecond latency.

WHY MULTI-TIER CACHING

Single-tier fails at scale. In-process cache is fast (microseconds) but limited by RAM. Distributed cache handles more data but adds 1-5ms latency. Persistent storage handles everything but takes 10-50ms. Multi-tier combines all: check local first, then distributed, then storage. Hit rates compound—90% local × 90% distributed = 99% before touching storage.

CACHE TIER ARCHITECTURE

L1 (in-process): LRU in application memory. 100MB-1GB per instance. Latency: 10-100 microseconds. L2 (distributed): Redis cluster. 10GB-1TB shared. Latency: 1-5ms. L3 (persistent): Feature store. Unlimited. Latency: 10-50ms. Each tier 10-100x slower but 10-100x larger.

💡 Key Insight: Cache the right things at each tier. L1: hot user embeddings (active session). L2: warm users (past hour). L3: everything else. A 10% active user base means 90% of requests hit L1.

INVALIDATION STRATEGIES

TTL-based: Expire after fixed time. Simple but may serve stale data. Event-driven: Invalidate on updates. Fresh but complex across tiers. Versioning: Version in cache key. New version = miss. Clean but increases key cardinality.

⚠️ Key Trade-off: Higher hit rates reduce latency but increase staleness. A 1-hour TTL means 1-hour stale features. For recommendations, acceptable. For fraud detection, not.
💡 Key Takeaways
Three tiers: L1 in-process (microseconds), L2 distributed (1-5ms), L3 persistent (10-50ms)
Hit rates compound: 90% L1 × 90% L2 = 99% total before storage
Cache hot users in L1, warm in L2—10% active users means 90% L1 hits
📌 Interview Tips
1Describe three-tier architecture with concrete latency numbers
2Mention invalidation trade-off: TTL simple, event-driven fresh but complex
← Back to Scalability (Sharding, Caching, Approximate Search) Overview
Multi-Tier Caching for Features and Embeddings | Scalability (Sharding, Caching, Approximate Search) - System Overflow