Embedding Cache: Reducing Repeated Vector Computation
WHAT IS AN EMBEDDING CACHE
Embedding caches store vector representations of inputs, saving the expensive forward pass through encoder models. When the same text, image, or audio appears again, retrieve the pre-computed embedding instead of re-running the encoder. Critical for systems where embedding computation dominates latency.
The numbers: text embedding models take 10-50ms per input. Image encoders take 50-200ms. Audio encoders even longer. Caching embeddings drops this to sub-millisecond retrieval. For high-throughput systems processing millions of queries, embedding cache hit rate directly impacts infrastructure costs.
EMBEDDING CACHE ARCHITECTURE
In-memory (Redis, Memcached): Fastest retrieval at sub-millisecond latency but limited capacity. A 768-dimensional float32 embedding is 3KB. One million cached embeddings requires 3GB RAM. Cost-effective for hot embeddings with high hit rates.
SSD-backed stores: 10-100x more capacity than RAM at slightly higher latency (1-5ms). Good for warm embeddings—frequently accessed but not hot enough to justify RAM cost.
Tiered architecture: Hot embeddings in memory, warm in SSD-backed store, cold recomputed on demand. Most production systems use this approach. Tier promotion/demotion based on access frequency.
PRE-COMPUTATION VS LAZY CACHING
Pre-computation: For known catalogs (products, documents), compute all embeddings offline before serving. Store in dedicated embedding store. Query-time embedding still needed for user queries, but catalog side is instant. Standard for search and recommendation systems with fixed item catalogs.
Lazy caching: For user-generated content, compute embedding on first access, cache for future requests. Set TTL based on content update frequency. User profile embeddings might cache for hours, trending content embeddings for minutes.
EMBEDDING VERSIONING
When you update the embedding model, all cached embeddings become invalid. Old and new embeddings are not comparable—similarity scores are meaningless across versions. You must invalidate entire cache or maintain parallel caches during migration.
{model_name}:{model_version}:{input_hash}. This prevents cross-version cache pollution automatically.