ML Model OptimizationModel Caching (Embedding Cache, Result Cache)Easy⏱️ ~3 min

Embedding Cache: Reducing Repeated Vector Computation

Embedding caches store the mapping from text to vectors, eliminating redundant computation when the same content is embedded repeatedly. This is particularly valuable in Retrieval Augmented Generation (RAG) systems where documents and queries are embedded frequently. The pattern appears at two points in the pipeline. At ingest time, document embeddings are computed once and stored with a version tag. Product descriptions at e-commerce sites, knowledge base articles at support systems, or video metadata at streaming platforms all remain stable for days or weeks. Recomputing these vectors on every query wastes both API costs and latency. On the query side, popular questions recur. Caching query embeddings with minute level Time To Live (TTL) absorbs traffic bursts and removes 30 to 60 percent of embedding API calls in many production systems. The key is a hash of the normalized text concatenated with the embedding model version. Normalization matters. Whitespace differences, capitalization, or trailing punctuation should not fragment the cache. Model version is mandatory because upgrading from one embedding model to another changes vector dimensions or geometry, making old cached vectors incompatible. A cache hit returns the vector in microseconds from memory versus 10 to 50 milliseconds for an API call or local model inference. Meta style systems keep feature and embedding reads under 5 milliseconds at hundreds of millions of queries per second by maintaining massive in memory caches. Pinterest caches embeddings for billions of pins and user profiles, enabling home feed generation with p99 vector search under 60 milliseconds. The tradeoff is memory footprint. A 1536 dimensional float32 vector consumes 6 kilobytes. Caching 10 million embeddings requires 60 gigabytes of RAM. For static content this is worthwhile. For highly dynamic personalized content, consider on demand computation with request coalescing to avoid cache stampedes.
💡 Key Takeaways
Document embeddings at ingest benefit from long Time To Live (TTL) values measured in weeks or months for static content like product catalogs or knowledge bases. Invalidate explicitly on content updates rather than using short TTL.
Query embeddings use minute level TTL to absorb burst traffic. Popular queries recur frequently within short windows. A cache with 5 minute TTL can eliminate 40 to 60 percent of embedding API calls during peak hours.
Cache key must include embedding model version. Upgrading from 768 to 1536 dimensional model or changing preprocessing logic invalidates all cached vectors. Use namespacing like text_ada_002:v1:hash(normalized_text).
Memory footprint scales with vector dimensions and cache size. A 1536 dimensional float32 vector is 6 kilobytes. Caching 10 million embeddings requires 60 gigabytes RAM. Budget accordingly for hot content.
Normalization before hashing is critical. Strip whitespace, lowercase where semantics allow, remove punctuation variations. Without normalization, How do I reset password? and how do i reset password create separate cache entries.
Request coalescing prevents cache stampede on popular misses. When multiple concurrent requests need the same uncached embedding, deduplicate so only one computes while others wait, then all share the result.
📌 Examples
Pinterest caches embeddings for billions of pins, enabling home feed vector search with p99 latency under 60 milliseconds. Without caching, embedding computation would dominate feed generation time and cost.
A RAG system serving 5,000 queries per second uses a 10 million entry embedding cache with 5 minute TTL for queries and persistent cache for 8 million documents. This reduces embedding API cost from $18K to $7K monthly and cuts p95 retrieval latency from 85ms to 35ms.
Meta feed ranking maintains feature and embedding caches in memory, achieving sub 5 millisecond p99 read latency at hundreds of millions of queries per second by precomputing and caching embeddings for all content before serving.
← Back to Model Caching (Embedding Cache, Result Cache) Overview
Embedding Cache: Reducing Repeated Vector Computation | Model Caching (Embedding Cache, Result Cache) - System Overflow