How Does PagedAttention and Prefix Caching Optimize Memory Management?
The Fragmentation Problem
PagedAttention manages KV cache memory by allocating it in fixed size pages or blocks rather than contiguous arrays, solving the severe fragmentation problem in traditional approaches. When each request allocates a variable length contiguous buffer for its KV cache, memory becomes fragmented with unusable gaps between allocations. As requests complete and new ones arrive with different sequence lengths, this fragmentation worsens until the system runs out of memory despite having sufficient total free space. Naive implementations report up to 80% memory waste from fragmentation. PagedAttention reduces waste to under 4%, enabling significantly higher concurrency.
How PagedAttention Works
The page based approach maintains a per request list of page references, similar to virtual memory in operating systems. When a sequence grows by one token and needs more KV storage, the system allocates a new page from a global pool and appends it to that request's page list. Attention computation is modified to gather keys and values from potentially non contiguous pages. This indirection adds minimal overhead, typically under 3% latency, while providing enormous flexibility. Pages can be shared across requests for beam search or prefix reuse, and freed pages are immediately available for other requests without defragmentation.
Prefix Caching
Prefix caching with radix tree matching identifies and shares common prompt prefixes across different requests. Many production workloads have repeated or similar prompts: system instructions, few shot examples, or document preambles. Instead of storing identical KV cache data multiple times, the system maintains a radix tree over tokenized sequences. When a new request arrives, it traverses the tree to find the longest matching prefix, reuses those KV pages, and only computes attention for the novel suffix. Combined with LRU eviction and cache aware scheduling that prioritizes requests with high cache hit rates, this can dramatically reduce redundant computation.
Failure Modes
The key failure mode is incorrect matching: prompts that differ only in whitespace, punctuation, or parameter values must be treated as distinct, or the wrong cached KV will be used and generate incorrect results. Cache invalidation on model updates and TTL expiration are also critical; stale cache can return outdated responses even when the model has been improved or fine tuned.