Three Layers of Model Caching: KV, Embedding, and Result
THREE CACHE LAYERS
KV Cache: For transformer-based language models, stores attention key-value pairs computed during generation. Each new token reuses prior tokens KV states instead of recomputing the entire sequence. Enables 10-100x speedup for long sequence generation by turning O(n²) recomputation into O(n) incremental work.
Embedding Cache: Stores computed embeddings (fixed-length vector representations) for entities like users, items, or documents. One embedding computation per entity per model version, reused across millions of requests. A user embedding computed once serves all that users recommendations for hours.
Result Cache: Stores complete model outputs for specific inputs. If the exact same query appears again, return cached result directly without any model computation. Works when queries repeat frequently or when semantic similarity allows approximate matching.
WHEN CACHING HELPS
Caching value depends on repetition rate and inference cost. A search system with 10% exact query repetition and 50ms inference time saves 5ms average latency. A recommendation system where 80% of users are returning users can cache user embeddings, saving 30-40% of compute cost.
The math: if cache hit costs 1ms and miss costs 100ms, you break even at 1% hit rate. At 50% hit rate, average latency drops from 100ms to 50.5ms. At 90% hit rate, average latency is 10.9ms.