ML Model OptimizationModel Caching (Embedding Cache, Result Cache)Easy⏱️ ~3 min

Embedding Cache: Reducing Repeated Vector Computation

WHAT IS AN EMBEDDING CACHE

Embedding caches store vector representations of inputs, saving the expensive forward pass through encoder models. When the same text, image, or audio appears again, retrieve the pre-computed embedding instead of re-running the encoder. Critical for systems where embedding computation dominates latency.

The numbers: text embedding models take 10-50ms per input. Image encoders take 50-200ms. Audio encoders even longer. Caching embeddings drops this to sub-millisecond retrieval. For high-throughput systems processing millions of queries, embedding cache hit rate directly impacts infrastructure costs.

EMBEDDING CACHE ARCHITECTURE

In-memory (Redis, Memcached): Fastest retrieval at sub-millisecond latency but limited capacity. A 768-dimensional float32 embedding is 3KB. One million cached embeddings requires 3GB RAM. Cost-effective for hot embeddings with high hit rates.

SSD-backed stores: 10-100x more capacity than RAM at slightly higher latency (1-5ms). Good for warm embeddings—frequently accessed but not hot enough to justify RAM cost.

Tiered architecture: Hot embeddings in memory, warm in SSD-backed store, cold recomputed on demand. Most production systems use this approach. Tier promotion/demotion based on access frequency.

PRE-COMPUTATION VS LAZY CACHING

Pre-computation: For known catalogs (products, documents), compute all embeddings offline before serving. Store in dedicated embedding store. Query-time embedding still needed for user queries, but catalog side is instant. Standard for search and recommendation systems with fixed item catalogs.

Lazy caching: For user-generated content, compute embedding on first access, cache for future requests. Set TTL based on content update frequency. User profile embeddings might cache for hours, trending content embeddings for minutes.

EMBEDDING VERSIONING

When you update the embedding model, all cached embeddings become invalid. Old and new embeddings are not comparable—similarity scores are meaningless across versions. You must invalidate entire cache or maintain parallel caches during migration.

✅ Best Practice: Include embedding model version in cache keys: {model_name}:{model_version}:{input_hash}. This prevents cross-version cache pollution automatically.
💡 Key Takeaways
Embedding cache saves expensive encoder forward pass: 10-200ms → sub-millisecond
Storage math: 768-dim float32 = 3KB. 1M embeddings = 3GB RAM
Pre-compute catalog embeddings offline, lazy-cache user query embeddings
Model version changes invalidate all cached embeddings—version your cache keys
📌 Interview Tips
1Interview Tip: Calculate storage requirements—1M embeddings × 768 dims × 4 bytes = 3GB. Shows you understand infrastructure sizing.
2Interview Tip: Describe tiered caching architecture—hot in RAM, warm on SSD, cold recomputed—with access frequency thresholds.
← Back to Model Caching (Embedding Cache, Result Cache) Overview