ML Model Optimization • Model Caching (Embedding Cache, Result Cache)Medium⏱️ ~3 min
Three Layers of Model Caching: KV, Embedding, and Result
Model caching operates at three distinct layers that solve different bottlenecks in the inference stack. Understanding where each applies prevents confusion and enables proper optimization.
Key Value (KV) cache lives inside the transformer during generation. When the model processes a prompt, it runs a prefill pass, then decodes token by token. Without caching, every new token would require recomputing attention over all previous tokens, creating quadratic cost. The KV cache stores attention keys and values for prior tokens, turning this into linear reuse. For a 2,000 token response, this changes the cost from catastrophic (tens of seconds per token) to manageable (3 to 10 milliseconds per token on modern GPUs). The tradeoff is memory: a 70 billion parameter model can consume tens of gigabytes of KV memory per 8,000 token sequence.
Embedding cache stores the mapping from text to vectors. Many systems compute embeddings for the same content repeatedly. Product descriptions at Pinterest, FAQ entries at support systems, or recurring user queries all benefit from this layer. Instead of calling the embedding model each time, a hash of normalized text plus model version acts as the key. This reduces API costs, smooths throughput during peak traffic, and cuts latency by 30 to 60 percent in retrieval systems. Result cache skips the model entirely by returning previous responses. Exact result cache uses the full prompt as key. Semantic result cache uses prompt embeddings and approximate nearest neighbor search to reuse answers when similarity exceeds a threshold like 0.85, trading higher hit rates (additional 10 to 25 percent) for risks around false matches.
💡 Key Takeaways
•KV cache operates inside the transformer during generation, storing attention keys and values to avoid quadratic recomputation. Reduces per token cost from seconds to 3 to 10 milliseconds but consumes tens of gigabytes per long sequence on large models.
•Embedding cache stores text to vector mappings using normalized text hash plus model version as key. Eliminates 30 to 60 percent of embedding compute in retrieval systems, particularly valuable for static content like product descriptions or FAQs.
•Exact result cache uses full prompt as key, achieving 10 to 30 percent hit rates in enterprise chat due to repeated workflows. Returns responses in 0.3 to 10 milliseconds versus seconds for full model inference.
•Semantic result cache uses prompt embeddings and approximate nearest neighbor search with similarity thresholds like 0.85. Adds 10 to 25 percent hit rate beyond exact cache but introduces risks of false matches and stale answers.
•These layers compose in production Retrieval Augmented Generation (RAG) systems. Check exact result cache first (0.3 to 2ms), then semantic cache (5 to 20ms with vector search), then use embedding cache during retrieval (saves 10 to 50ms), finally invoke model with KV cache.
•Memory versus throughput tradeoff in KV cache: concurrent batch size is constrained by GPU memory budget. A single 8,000 token sequence on a 70B parameter model can consume more memory than the model weights themselves.
📌 Examples
Pinterest runs billion scale vector search for home feed retrieval with p99 latency under 60 milliseconds by caching embeddings for popular pins and user profiles, avoiding recomputation on every feed load.
Meta feed ranking systems use massive memory caches to keep feature and embedding lookups under 5 milliseconds while serving hundreds of millions of reads per second, combining all three cache layers.
A typical RAG flow: canonicalize prompt, check exact cache (2ms), check semantic cache with HNSW index over 100M prompts (15ms at p95), run retrieval with cached embeddings (30ms), invoke model with KV cache (12 seconds for 2,000 tokens at 6ms per token).