Learn→ML-Powered Search & Ranking→Scalability (Sharding, Caching, Approximate Search)→2 of 6

ML-Powered Search & Ranking • Scalability (Sharding, Caching, Approximate Search)Medium⏱️ ~2 min

Multi-Tier Caching for Features and Embeddings

Definition
Multi-tier caching layers multiple cache levels—in-process, distributed, and persistent—to serve ML features and embeddings with sub-millisecond latency.
WHY MULTI-TIER CACHING
Single-tier fails at scale. In-process cache is fast (microseconds) but limited by RAM. Distributed cache handles more data but adds 1-5ms latency. Persistent storage handles everything but takes 10-50ms. Multi-tier combines all: check local first, then distributed, then storage. Hit rates compound—90% local × 90% distributed = 99% before touching storage.
CACHE TIER ARCHITECTURE
L1 (in-process): LRU in application memory. 100MB-1GB per instance. Latency: 10-100 microseconds. L2 (distributed): Redis cluster. 10GB-1TB shared. Latency: 1-5ms. L3 (persistent): Feature store. Unlimited. Latency: 10-50ms. Each tier 10-100x slower but 10-100x larger.
💡 Key Insight: Cache the right things at each tier. L1: hot user embeddings (active session). L2: warm users (past hour). L3: everything else. A 10% active user base means 90% of requests hit L1.
INVALIDATION STRATEGIES
TTL-based: Expire after fixed time. Simple but may serve stale data. Event-driven: Invalidate on updates. Fresh but complex across tiers. Versioning: Version in cache key. New version = miss. Clean but increases key cardinality.
⚠️ Key Trade-off: Higher hit rates reduce latency but increase staleness. A 1-hour TTL means 1-hour stale features. For recommendations, acceptable. For fraud detection, not.

💡 Key Takeaways

✓Three tiers: L1 in-process (microseconds), L2 distributed (1-5ms), L3 persistent (10-50ms)

✓Hit rates compound: 90% L1 × 90% L2 = 99% total before storage

✓Cache hot users in L1, warm in L2—10% active users means 90% L1 hits

📌 Interview Tips

1Describe three-tier architecture with concrete latency numbers

2Mention invalidation trade-off: TTL simple, event-driven fresh but complex

← Back to Scalability (Sharding, Caching, Approximate Search) Overview