LLMs · Systems

Efficient Memory Management for Large Language Model Serving with PagedAttention

The KV cache as pages, like an operating system.

An LLM serving system’s throughput is set by how many requests share the GPU’s memory at once. PagedAttention lays the KV cache out the way an operating system lays out RAM, fixed-size blocks plus a translation table, and packs roughly 96% of the reserved memory with real data instead of 20-38%.

Explaining the paperEfficient Memory Management for Large Language Model Serving with PagedAttentionKwon, Li, Zhuang, Sheng, Zheng, Yu, Gonzalez, Zhang, Stoica · UC Berkeley · SOSP 2023 · arXiv:2309.06180 ↗

Serving a large language model cheaply is a memory-allocator problem before it is a kernel problem, and the allocator most servers ship is sixty years behind the operating systems they run on.

A served language model spends most of its life decoding one token at a time. Per request, the work is tiny: a few dozen matrix-vector products that read the model weights and the request’s key-value cache, and emit a single integer. The model weights are the same for every request, so once they are loaded the GPU is mostly waiting on memory rather than running out of arithmetic. The only way to get more throughput is to run more requests in parallel and amortize that weight load.

But each in-flight request carries its own state. For a Transformer that state is the KV cache: the keys and values produced by every previous token, stored so the next token’s attention can read them without recomputing. On OPT-13B a single token of KV cache is about 800 KB (2 for K and V, times 5120 for the hidden width, times 40 layers, times 2 bytes for fp16); a 2048-token sequence is 1.6 GB. Pile up enough requests and the KV cache, not the weights, sets the batch size.

PagedAttention, out of UC Berkeley, makes one simple observation. Existing LLM servers store the KV cache the way a deep-learning framework stores any tensor, a single contiguous slab per request, and that layout wastes most of the slab. Borrow virtual memory and paging from operating systems, store the cache in non-contiguous fixed-size blocks, and you reclaim almost all of it. The system they build on top, vLLM, runs OPT and LLaMA at 2 to 4 times the throughput of the previous best, on the same hardware, with the same accuracy.

This explainer covers why the KV cache dominates GPU memory, what the contiguous layout wastes, how paging maps onto attention, and what the resulting packing buys in throughput. The interactive figures run a small worked example for each piece.

The KV cache sets the batch size

The attention layer is what lets each token read the tokens before it. For an input position $i$ with hidden state $x_i$ , the layer first projects three linear views, the query, key, and value:

q_i = W_q\,x_i, \quad k_i = W_k\,x_i, \quad v_i = W_v\,x_i

(2)

The output at position $i$ is a softmax-weighted average over the value vectors at every earlier position:

a_{ij} = \frac{\exp(q_i^\top k_j / \sqrt{d})}{\sum_{t=1}^{i} \exp(q_i^\top k_t / \sqrt{d})}, \qquad o_i = \sum_{j=1}^{i} a_{ij}\,v_j

(3)

Because the output depends on every earlier $k_j$ and $v_j$ , the KV cache has to be kept around. Generation runs in two phases. The prompt phase sees the whole input at once and produces all the prompt’s K and V vectors in a single matrix-matrix multiply, fast on a GPU. The generation phase then takes one token, computes its lone new $k$ and $v$ , appends them to the cache, and runs equation (3) against the entire cache to produce one output. Then a new token. Then another. Each step is a thin matrix-vector product against a growing pile of bytes, which is why decoding is memory-bound: the arithmetic is small, the read is big.

For batching purposes the relevant number is the KV bytes per token, per layer. On OPT-13B that is $2 \times 5120 \times 2 = 20{,}480$ bytes per layer per token, multiplied by 40 layers to give about 800 KB per token; on a 2048-token sequence that grows to 1.6 GB. An A100 with 40 GB and 26 GB of weights (13B parameters at 2 bytes each) leaves about 12 GB for KV, room for only a handful of full-length requests at once. The figure below holds one request and lets you walk it forward token by token, so you can see how the read at each step grows in lockstep with the cache. As the cache row stretches and the bottom bar grows, the work per step is exactly this much memory traffic, every step.

The KV cache problem is the canonical memory bottleneck of LLM serving, and it is the same fact that motivates two sibling lines of work this site covers. The closest neighbor is multi-query attention, which cuts the cache itself by sharing one set of keys and values across heads (and grouped-query attention in the Mistral page is the modern compromise). PagedAttention takes the opposite angle: leave the cache the size it is, and lay it out so almost none of the reserved bytes are wasted.

Three shapes of waste

Why does the old layout waste anything? Because to make attention legal, deep-learning frameworks expect a tensor to live in one contiguous span of memory, and the server does not know up front how long a request will be. Prior servers (FasterTransformer, Orca) handled the unknown length by allocating, at request arrival, a contiguous slab big enough for the worst case: the model’s maximum sequence length, 2048 tokens for OPT. Then the request runs and only a fraction of the slab ever fills.

That leaves three distinct chunks of dead memory, and the figure below makes them visible. Drag either request’s length and watch the bands shift:

Figure 1 · three shapes of waste

req A generated2 tok

req B generated2 tok

Two requests each reserve a max-length KV slab up front. The teal cells are real prompt and generated KV. The amber dashed region at the right of each slab is internal fragmentation: bytes the request will eventually fill but does not need yet. The thin violet band between the two slabs is external fragmentation: bytes the buddy allocator (which hands out memory in power-of-two-sized chunks) left behind in the gap, unusable by anyone because no slab fits. The meter shows what fraction of the reserved memory is real KV.

The three names pin down distinct things: reservation is space the request will use eventually but has not filled yet, internal fragmentation is space inside the request’s slab that it will never fill (its actual output came in shorter than 2048), and external fragmentation is space between requests that the allocator could not pack tightly because the slabs come in different sizes. The paper measures all three across four systems on the ShareGPT trace and lays them out in figure 2 of the paper, which we redraw next.

Figure 2 · reserved memory, broken down

Average percentage of KV cache memory doing useful work, across four systems on ShareGPT. Orca (Max) reserves the full 2048-token slab for every request and uses only 20.4% of it. Orca (Pow2) rounds the slab to the next power of two and recovers some ground (26.8%). Orca (Oracle), the unachievable upper bound that assumes each request’s true output length is known in advance, still only hits 38.2%; the rest is allocator waste. vLLM uses 96.3%. Numbers are the paper’s, §6.2 / Fig 2.

Notice what the Oracle row implies. Even if a server somehow knew, at request arrival, exactly how many tokens each request would generate, the contiguous layout would still waste 60% of the KV memory. The allocator can only hand out slabs in sizes it can place contiguously, so even perfect length prediction cannot pack a heterogeneous workload.

The contiguous layout also cannot share. When two output samples from the same prompt would naturally use identical K and V vectors for the prompt portion, two contiguous slabs cannot point at the same bytes. The KV cache for the prompt gets duplicated per sample. The paper later measures up to 55% memory savings from sharing in beam search, savings the old layout simply cannot collect.

Borrow paging from the operating system

The same problem appears in operating systems, which solved it sixty years ago. A process running in an OS does not get a single contiguous block of physical RAM. Its memory is split into fixed-size pages, each page lives wherever the OS decides to put it, and a per-process page table translates the program’s contiguous-looking virtual addresses into the physical ones. Pages are allocated lazily, freed when the process is done with them, and shared between processes when the contents are identical (think a shared library). That machinery exists because OS designers, like LLM servers, had to manage many processes’ unpredictable memory in a fixed pool, and the contiguous layout broke for the same reasons.

PagedAttention transplants that machinery. A request’s KV cache becomes a list of fixed-size KV blocks; each block holds the K and V vectors for some small number of tokens (the paper picks 16 by default). The sequence sees a contiguous list of logical KV blocks; the GPU’s KV pool holds physical KV blocks at non-contiguous addresses; a block table per request maps logical to physical. New blocks are allocated only when the sequence’s last logical block is full and a new token needs space. Internal fragmentation drops to at most one block per request. External fragmentation disappears entirely, because every block is the same size and any free block fits any new allocation. The vocabulary translates cleanly: tokens are bytes, KV blocks are pages, requests are processes.

The translation buys two things at once. First, the memory accounting: the wasted bytes go away because allocation is on-demand and uniform. Second, sharing: because a physical block is identified by an address and a reference count, two logical blocks (in different sequences) can point at the same physical block as long as nobody writes into it. Read-only sharing makes parallel sampling and beam search efficient, and copy-on-write handles the case where a shared block is written, covered below.

PagedAttention: read the scattered blocks

Spreading the KV cache across blocks would normally break the attention kernel, which assumes its K and V tensors are contiguous so it can load them with one wide read. The paper’s name comes from the kernel change that makes that layout legal. Equation (3) is rewritten over blocks:

A_{ij} = \frac{\exp(q_i^\top K_j / \sqrt{d})}{\sum_{t=1}^{\lceil i/B \rceil} \exp(q_i^\top K_t / \sqrt{d})\,\mathbf{1}}, \qquad o_i = \sum_{j=1}^{\lceil i/B \rceil} V_j\,A_{ij}^\top

(4)

The symbols unpack as follows. The block size is $B$ , the count of blocks already filled for the query at position $i$ is $\lceil i/B \rceil$ , and $K_j = (k_{(j-1)B+1}, \dots, k_{jB})$ is the K matrix stacking the keys for the $j$ -th block. $A_{ij}$ is the row of attention weights this query puts on block $j$ , a length- $B$ vector. Equation (4) does the same mathematics as equation (3); it just groups the sum by block instead of by token, so the kernel can fetch one K block from its physical address, compute the partial scores, fetch the next, and so on.

The denominator is the only part that needs care. A softmax’s normalizer is a global sum over every score, so the kernel cannot finish the denominator until it has seen every block. The same running-max trick that powers FlashAttention works here: keep a running maximum and a running denominator, rescale the partial output every time the maximum bumps up. The output equals the contiguous-kernel result up to floating-point summation order, which is what the paper means by "no accuracy change."

The figure below shows one step of decoding in motion. The query is at the bottom; three of the request’s KV blocks live at different addresses in the physical pool above; the kernel walks the block table, fetches each block in turn, and folds its contribution into the running output:

Figure 3 · attention over scattered KV blocks

The current query token sits below. The three KV blocks belonging to this request occupy non-contiguous spots in the GPU’s physical pool. The kernel walks them in logical order, fetches

K_j, V_j

from each address, and folds the partial scores into a single output

o_i

. The math is eq (4), exact.

What changes about the kernel’s real work? Three things. The block table costs an extra indirection per block: the kernel reads where to go before reading what to compute. The kernel also has to handle blocks of varying validity (the last block may be partly empty), and it can no longer do a single large stride load because the addresses change between blocks. Section 7.1 measures the total cost: a 20 to 26% slower attention kernel than the highly-tuned FasterTransformer baseline. The end-to-end throughput improvement comes from packing more requests in, not from the attention being faster per call.

For grounding, here is the contiguous kernel that this replaces, and the paged version side by side, so the block-table loop is visible:

# attention with a contiguous K, V (the old way)
S = (q @ K.T) / sqrt(d)            # K shape: [seq_len, d]; q: [d]
A = softmax(S)                     # one row, all seq_len entries
o = A @ V                          # V shape: [seq_len, d]; o: [d]

# attention over BLOCKWISE K, V at non-contiguous addresses (eq 4)
o = zeros(d)                       # accumulator for this query token
denom = 0                          # running softmax denominator
m = -inf                           # running row-max for numeric safety
for j in block_table[req]:         # walk the LOGICAL blocks in order
    K_j, V_j = pool[j]             # fetch block j from wherever it sits
    s = (q @ K_j.T) / sqrt(d)      # [B] scores against this block
    m_new = max(m, max(s))
    p = exp(s - m_new)             # local softmax weights
    rescale = exp(m - m_new)
    o = rescale * o + p @ V_j      # fold in this block's contribution
    denom = rescale * denom + sum(p)
    m = m_new
return o / denom                   # exact softmax-weighted output

The block table, step by step

The block table is one list per request, of the same length as that request’s logical block count, mapping each logical index to a physical block id plus a fill counter (how many of the block’s $B$ token slots are written). Allocation rule: when a new token arrives and the last logical block has room, write into that block and bump its counter; when it is full, grab a fresh physical block from the pool’s free list and append a new logical entry to the table. Free rule: when the request finishes, return all its physical blocks to the pool.

The figure shows it running on a small example. The block size is 4, the prompt is "Four score and seven years ago our" (7 tokens), and the model decodes more tokens one at a time. Drag the decode step and watch the block table grow on the right and the new mappings fall into cells spread through the pool:

Figure 4 · block table in action

decode step9/16

Left, the request’s logical KV blocks, filled left-to-right with prompt (teal) then generated (amber) tokens. Right, the GPU’s physical pool; only the cells the block table maps to are live. The curved arrows are the block table, one per logical block. When a logical block fills up, the next decode step allocates a fresh physical block somewhere in the pool, never adjacent to the previous one, and a new arrow appears.

Three properties to notice as you drag. One, the pool fills only as the request needs space; if a request ends at 7 tokens it never reserves a 9th block. Two, the physical blocks the request uses sit wherever the pool has room; there is no attempt to keep them adjacent, because adjacency does not buy anything: the kernel in equation (4) handles non-contiguous addresses. Three, the last logical block is almost always partly empty (between 1 and $B-1$ wasted slots), so internal fragmentation drops to at most one block per request, a few percent at $B = 16$ for sequences of even modest length.

That bound is the reason the choice of block size matters. Small $B$ means almost no internal waste but more block-table entries to manage and more kernel calls per attention step. Large $B$ means fewer calls but more wasted space in the last block, and worse, less opportunity for sharing across sequences (two sequences with a shared prefix that splits mid-block cannot share the split block at all). The paper sweeps it in §7.2 and settles on $B = 16$ : large enough for the kernel to do useful work per call, small enough that the partly-empty tail and the un-shareable split-block hits are both modest.

Sharing: copy-on-write across sequences

Once the KV cache is a list of pointers, the layout permits something the contiguous version cannot do: two logical blocks, in different sequences, can point at the same physical block. The paper uses this in three settings that all look the same once you frame them as block-table tricks.

The simplest is parallel sampling: a user sends a single prompt and asks for several different samples. Standard practice is to run the prompt phase once and then fork the model’s state into $n$ sequences that diverge as each samples its own next token. With contiguous slabs the prompt’s KV cache has to be duplicated $n$ times; with paged blocks each sample’s logical blocks for the prompt portion point at the same physical blocks, with a reference count tracking how many sequences share each one.

Diverging at the next-token boundary is fine if it falls on a block boundary, because the new token needs a fresh block anyway. But it almost never does. The last prompt block is usually partly empty, so the first generated token must be written into a block that other samples are still reading. That is exactly the case virtual-memory systems solved with copy-on-write: the system intercepts the write, allocates a fresh physical block, copies the existing bytes into it, drops the reference count on the original, and redirects the writing sample’s logical pointer to the new copy. The other samples keep their reference to the original.

Figure 5 · copy-on-write on a shared prompt

generated step1/2

Two samples from one prompt. At step 0 both samples’ logical blocks point at the same two physical blocks (#7 and #1, reference count 2). At step 1 each sample writes a different next token. Sample B’s write triggers a copy-on-write: vLLM allocates physical block #3, copies the partial prompt block into it, drops #1’s reference count to 1, and B writes its new token into #3. Sample A is unaffected; its block #1 still has both prompt tokens and its own next token, exactly as before.

Beam search plays the same copy-on-write pattern out in a tree. A width- $k$ beam keeps the $k$ most probable candidate sequences at every step; many of them share long prefixes, and the prefixes drift apart and re-converge as candidates rise and fall. Section 4.4 details a beam-search example where the candidates form a directed graph of shared physical blocks, joined where the candidates agree and split where they differ, with the same copy-on-write rule whenever a write needs to land in a shared cell. The paper measures 37.6 to 55.2% memory saved on Alpaca traces from this single mechanism.

A subtler case is the shared prefix, an instruction or few-shot example glued to the front of every request in a workload. With paging, the server pre-computes the prefix’s KV blocks once and gives every arriving request a block table that starts with pointers to those frozen blocks (marked copy-on-write so nobody clobbers them). The translation example from §6.4 measures a 3.58x throughput improvement over Orca (Oracle) with a 5-shot, 341-token prefix, almost entirely from not recomputing the prefix’s KV cache for every request.

Preemption: when the pool runs out

Even with denser packing, the GPU’s KV pool is finite, so requests eventually saturate it. Two questions show up: which request to evict, and what to do with its blocks. vLLM’s scheduler is first-come-first-served and evicts the most recently arrived request first (so earlier requests are preserved). On the recovery side it offers two options and the paper measures both.

Swapping follows the classical OS recipe: copy the evicted blocks to host RAM through PCIe and copy them back when the request resumes. Recomputation throws the blocks away and regenerates them by replaying the prompt phase when the request restarts; this only works for KV cache, because the prompt is still available and the model is deterministic given the prompt, so a fresh prompt phase reconstructs the same K and V vectors. The two methods trade differently against block size. Recompute is roughly constant in block size (its cost is the prompt-phase compute), while swap gets cheaper with bigger blocks (one PCIe round-trip per block, so fewer trips with larger blocks). At the paper’s default $B = 16$ they sit within a few percent of each other on the OPT-13B ShareGPT trace (§7.3), so vLLM ships swapping but the choice is workload-dependent.

vLLM evicts at the granularity of a sequence group, not an individual block, because the sequences inside a group (parallel samples, beam candidates) share blocks. If you evict one candidate’s logical block without evicting its siblings, the reference count on the underlying physical block goes nowhere and no memory is freed.

What the packing buys

The headline measurement, figure 12a in the paper, is normalized latency (seconds per generated token, averaged over requests) against request rate on OPT-13B and ShareGPT. Each system runs flat at low load and then knees upward as the in-flight set saturates GPU memory. vLLM’s knee sits 2 to 3 times further to the right than the best Orca variant; the same A100 holds more requests at any given latency target.

Figure 6 · capacity curves

Normalized latency vs request rate on OPT-13B with the ShareGPT trace. FasterTransformer knees first, Orca progressively later as it gets more accurate length predictions, and vLLM later still. The curves are stylized but the knees and relative ordering match §6.2 of the paper. vLLM sustains 1.7 to 2.7× the rate of Orca (Oracle), and 2.7 to 8× the rate of Orca (Max), at the same target latency.

Same hardware, same model, same accuracy. The lift comes from running more requests at once: figure 13a in the paper measures the average concurrent batch as 30.4 for vLLM versus 7.0 to 13.6 for Orca, a 2.2 to 4.3 times wider batch.

The improvement compounds in workloads that share KV cache. Parallel sampling brings 6.1 to 9.8% raw memory savings from prompt sharing (across sample counts up to $n=6$ ); beam search brings 37.6 to 55.2% (across beam widths up to $k=6$ ). The beam-search throughput lift over Orca (Oracle) goes from 1.3x in basic sampling to 2.3x at beam width 6. And the shared-prefix translation workload, where the prefix changes once per workload but each request’s inputs differ, hits 3.58x over Orca (Oracle) at a 5-shot, 341-token prefix.

Two limits to be exact about. First, the attention kernel itself is 20-26% slower per call (§7.1), so vLLM breaks even or loses on workloads where the GPU memory already comfortably holds the batch. Figure 12f, OPT-175B on Alpaca with 8 A100-80GBs, shows the curves converging because there is enough KV memory that even Orca (Pow2) can hold a deep batch. Second, the throughput claim is against the 2023 baselines (Orca, FasterTransformer); newer serving systems have since picked up paged management as a default, so the appropriate baseline today is "any modern serving system" rather than the original Orca. The paper’s lasting contribution is the layout idea, not the absolute 2-4x against those particular baselines.

Stepping back, the authors recast an unsolved scheduling problem in LLM serving as an operating-system problem from the 1960s, and port the solution across with the right amount of LLM-specific care: the all-or-nothing sequence-group eviction, the recompute fallback that swapping does not have access to, the kernel rewrite to tolerate non-contiguous blocks. It sits beside two siblings on the same cache. Multi-query attention shrinks the cache by sharing keys and values across heads. FlashAttention keeps a single call’s working set on chip. PagedAttention manages the long-lived cache across calls. Modern serving systems run all three at once, which is why a fresh draft of the same problem looks nothing like the 2023 baselines vLLM was measured against.

Provenance Verified against primary literature

Kilburn et al. (1962)One-level storage: the original virtual-memory paper, reference [25] in the paper.

MQA (2019)The other branch of the KV-cache lineage: cut the cache itself instead of managing it better.

FlashAttention (2022)Tiling cuts attention’s on-chip working set; complementary to paging the cache.

Orca (2022)Iteration-level scheduling, the baseline vLLM compares against. Paging is the missing piece.

OPT-13B specs800 KB per token = 2 × 5120 × 40 × 2 bytes (K and V, hidden 5120, 40 layers, fp16). Verified against the OPT release.

correctionThe 2-4× throughput headline holds against Orca and FasterTransformer in the paper’s 2023 evaluation. The speedup does not hold at every scale: the kernel pays a 20-26% attention-kernel overhead (§7.1), so the speedup comes from packing more requests in the same memory, not from the kernel being faster per request. We also separate "paging" from "iteration-level scheduling"; vLLM does both, but the paper credits scheduling to Orca (§9).

Questions you might still have

Why does an LLM server’s throughput hinge on memory at all?
Decoding generates one token at a time, so a single request barely uses the GPU’s arithmetic. Batching many requests amortizes the model weights, but each request needs a private KV cache that grows with its length. The number of requests you can fit at once is set by KV memory, not FLOPs, and that batch size sets throughput.

Is the attention output identical to a normal kernel’s?
Yes. Eq (4) splits the softmax denominator and the value mixture across blocks but never approximates either. The denominator is accumulated with the online-softmax trick (a running max so partial scores stay numerically safe), and the final output equals the contiguous-kernel result up to floating-point summation order.

What does PagedAttention NOT speed up?
The attention kernel itself, slightly. Section 7.1 measures a 20-26% per-attention-call overhead from the extra block-table lookup. The end-to-end win comes from running more requests in the same memory; if your workload already fits comfortably (short prompts, big GPU) the gain shrinks toward compute-bound (Fig 12f, OPT-175B on Alpaca).

How does this relate to FlashAttention?
They solve different memory problems. FlashAttention keeps the per-call attention working set on-chip so a single forward pass moves less data through HBM. PagedAttention manages the long-lived KV cache across many requests so more of them fit in HBM at once. vLLM actually adopts FlashAttention’s kernel patterns for the inner loop while replacing how the cache is laid out across calls.

How does this compare to multi-query attention?
Multi-query attention attacks the SIZE of the KV cache by sharing one set of keys and values across heads. PagedAttention takes the cache as given and changes its layout. The two are complementary, and modern serving systems combine grouped-query attention (the MQA family, where several query heads share one set of keys and values, a middle ground between full multi-head attention and a single shared head) with paged management.

Footnotes & further reading

The paper: Kwon, Li, Zhuang, Sheng, Zheng, Yu, Gonzalez, Zhang, Stoica, Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023). The vLLM project is at github.com/vllm-project/vllm.
Sibling KV-cache paper: Multi-Query Attention (Shazeer 2019, arXiv:1911.02150) cuts the cache itself by sharing keys and values across heads. PagedAttention manages the cache as it is. Modern serving systems combine the two.
Complementary attention work: FlashAttention (Dao et al. 2022, arXiv:2205.14135) tiles a single attention call to keep its working set on-chip. vLLM’s kernel borrows that tiling idea while replacing how the KV cache is laid out across calls.
Grouped-query attention as the modern MHA / MQA compromise sits in the Mistral 7B explainer.
The OS analogy traces back to Kilburn, Edwards, Lanigan, Sumner, One-level storage system (IRE Transactions on Electronic Computers, 1962), reference [25] in the paper.
Orca, the paper’s primary baseline: Yu, Jeong, Kim, Kim, Chun, Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI 2022). Iteration-level scheduling is its idea; vLLM keeps it and adds paging.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.