Learn→Model Serving & Inference→Latency Optimization (Batching, Caching, Quantization)→4 of 6

Model Serving & Inference • Latency Optimization (Batching, Caching, Quantization)Hard⏱️ ~3 min

How Does PagedAttention and Prefix Caching Optimize Memory Management?

The Fragmentation Problem
PagedAttention manages KV cache memory by allocating it in fixed size pages or blocks rather than contiguous arrays, solving the severe fragmentation problem in traditional approaches. When each request allocates a variable length contiguous buffer for its KV cache, memory becomes fragmented with unusable gaps between allocations. As requests complete and new ones arrive with different sequence lengths, this fragmentation worsens until the system runs out of memory despite having sufficient total free space. Naive implementations report up to 80% memory waste from fragmentation. PagedAttention reduces waste to under 4%, enabling significantly higher concurrency.
How PagedAttention Works
The page based approach maintains a per request list of page references, similar to virtual memory in operating systems. When a sequence grows by one token and needs more KV storage, the system allocates a new page from a global pool and appends it to that request's page list. Attention computation is modified to gather keys and values from potentially non contiguous pages. This indirection adds minimal overhead, typically under 3% latency, while providing enormous flexibility. Pages can be shared across requests for beam search or prefix reuse, and freed pages are immediately available for other requests without defragmentation.
Prefix Caching
Prefix caching with radix tree matching identifies and shares common prompt prefixes across different requests. Many production workloads have repeated or similar prompts: system instructions, few shot examples, or document preambles. Instead of storing identical KV cache data multiple times, the system maintains a radix tree over tokenized sequences. When a new request arrives, it traverses the tree to find the longest matching prefix, reuses those KV pages, and only computes attention for the novel suffix. Combined with LRU eviction and cache aware scheduling that prioritizes requests with high cache hit rates, this can dramatically reduce redundant computation.
Failure Modes
The key failure mode is incorrect matching: prompts that differ only in whitespace, punctuation, or parameter values must be treated as distinct, or the wrong cached KV will be used and generate incorrect results. Cache invalidation on model updates and TTL expiration are also critical; stale cache can return outdated responses even when the model has been improved or fine tuned.

💡 Key Takeaways

✓PagedAttention allocates KV cache in fixed size pages instead of contiguous arrays, reducing memory fragmentation from up to 80% waste in naive allocation down to under 4%

✓Page based management maintains per request page lists and allocates from a global pool, enabling sharing across beam search branches and immediate reuse of freed pages without defragmentation

✓Prefix caching with radix tree matching identifies common prompt prefixes across requests and shares their KV pages, avoiding redundant computation for repeated system instructions or conversation history

✓Least Recently Used (LRU) eviction prioritizes keeping hot prefixes in cache while discarding cold entries, and cache aware scheduling prioritizes requests with high expected hit rates to maximize reuse

✓Production systems report lower latency for repeated prompts as only the unique suffix requires inference; multi turn conversations reuse full conversation history prefix across turns

✓Critical failure mode is incorrect cache key matching: prompts differing only in whitespace or parameters must be distinguished or wrong cached KV produces incorrect results; model updates require cache invalidation

📌 Interview Tips

1Google production systems use PagedAttention to enable 3× to 4× higher concurrent sessions before out of memory (OOM) compared to naive contiguous allocation, particularly benefiting workloads with heterogeneous sequence lengths

2Prefix caching in multi turn chatbots reuses the system instruction plus conversation history prefix, computing attention only for the new user message and assistant response, reducing latency by 50% for turns after the first

3A serving stack with 24 GB GPU memory and 7B model (14 GB weights) can fit 10 to 12 concurrent 2k token sessions with PagedAttention versus only 3 to 4 with naive allocation due to fragmentation

← Back to Latency Optimization (Batching, Caching, Quantization) Overview