LLMs · Efficiency

Mistral 7B

Q: If each layer only looks at 4096 tokens, how does Mistral use a longer context?

Depth. Stacking 32 windowed layers compounds the reach to about 32 x 4096 = 131K tokens, because each layer can pull in information another window deep. The window is a per-layer radius, not the context limit (the model is trained at length 8192).

Q: Is grouped-query attention just multi-query attention?

No. Multi-query uses a single shared key/value head, shrinking the cache about 32x but costing some quality. Grouped-query uses several (Mistral keeps 8 for its 32 query heads), shrinking the cache 4x while staying close to full multi-head quality.

Q: Does the sliding window throw away quality?

The paper reports the rolling buffer cuts cache memory 8x at 32k tokens with no measured quality drop, and Mistral still tops Llama 2 13B almost everywhere. Local attention plus depth recovers the long-range information a single wide layer would have carried.

Q: Did Mistral invent any of this?

No single piece. Grouped-query attention, local/sliding-window attention, and the KV-cache tricks all predate it, and the rest of the model is Llama. The contribution is assembling them into a small model that is both strong and cheap, and showing it beats ones nearly twice its size.

A 7B model that beats a 13B, by fixing attention.

Mistral 7B is, under the hood, an ordinary Llama-style transformer. What makes it punch above its weight is a handful of changes to attention that make it cheap to run. This is an efficiency paper, and the savings all land in one place: the KV cache.

Explaining the paperMistral 7BJiang, Sablayrolles, Mensch, et al. · Mistral AI · 2023 · arXiv:2310.06825 ↗

How does a 7-billion-parameter model keep up with one nearly twice its size? Mostly by being cheaper to run.

In late 2023 a small team at Mistral AI released a 7-billion-parameter language model that did something the size charts said it should not. Mistral 7B beat Llama 2 13B, a model with almost twice as many parameters, on nearly every standard benchmark, and it matched or beat Llama 1 34B on reasoning, mathematics, and code. It shipped under an Apache 2.0 license, which meant anyone could use it for anything, and for a while it became the default open model people reached for.

The surprising part is how little of the paper is about the model itself. There is no secret training recipe in it, no exotic new layer, no claim about the data. Mistral 7B is a fairly standard transformer in the Llama lineage. It uses the same now-ordinary ingredients Llama introduced, rotary positional embeddings (RoPE), a SwiGLU feed-forward block, and RMSNorm. (The paper does not actually walk through those three. It treats them as settled and inherited, and so will we. They are confirmed by the reference implementation, not by the text.) What the paper actually details is narrower: how to make the model cheap to run.

That emphasis matters for the rest of this explainer. For a model you train once and then deploy, the bill that never stops arriving is the inference bill. You pay it on every token you ever generate, for every user, forever. And almost all of that cost, at the long context lengths people actually want, is memory traffic in the attention layer rather than arithmetic. The bottleneck is the reading and writing of something called the KV cache. Mistral 7B applies four ideas to that cache: grouped-query attention, a sliding window, a rolling buffer, and chunked pre-fill. Understand those and you have the paper.

The cost that matters is inference

A language model generates text one token at a time. To produce the next word it runs a full forward pass through all its layers (one trip through the network, input to output), appends the new token, and runs another full pass for the word after that. This is what autoregressive means: each token is conditioned on every token before it, so they have to be produced in sequence, one pass each.

Two things get loaded from memory on every one of those passes. The first is the model's weights, which is a fixed cost set by the parameter count. The second grows: the running memory of everything said so far, held in the attention layers. At a short prompt this second cost is nothing. At a long one it dominates, and it is the part Mistral attacks. The model has to be accurate and keep that growing cost small, so a 7B model can serve long contexts on hardware that would choke on a larger one.

Most of the architecture is unremarkable; the two lines that matter most are n_kv_heads and window_size. The full configuration, so the numbers later have somewhere to land:

dim          = 4096     # model width (d_model)
n_layers     = 32       # transformer blocks stacked
n_heads      = 32       # query heads
n_kv_heads   = 8        # key/value heads -> GQA, groups of 4
head_dim     = 128      # per-head width (32 x 128 = 4096)
hidden_dim   = 14336    # SwiGLU feed-forward inner width
window_size  = 4096     # sliding-window attention radius (W)
context_len  = 8192     # trained sequence length
vocab_size   = 32000    # SentencePiece tokens

The same number shows up twice for unrelated reasons. dim = 4096 is the model's width, the length of each token's vector. window_size = 4096 is how far back attention reaches. They are equal by coincidence and mean completely different things. And neither is the context length, which is $8192$ .

Attention, and the KV cache

Start with what attention does on a single layer, because the cache falls straight out of it. Each token already lives as a vector (its embedding), and a layer projects that vector into three others: a query $q$ , a key $k$ , and a value $v$ . To compute a token's output, you take its query, score it against the key of every token (a dot product, which is large when two vectors point the same way), turn those scores into weights with a softmax (which rescales a list of numbers into positive weights that sum to one), and return the weighted average of the values. That is the whole operation:

\text{Attention}(Q,K,V) = \operatorname{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right)V

(1)

The $\sqrt{d_k}$ in the denominator (with $d_k$ the dimension of a key) is just a scaling constant. Without it, in high dimensions the dot products grow large, the softmax puts almost all the weight on one token, and gradients vanish. Dividing by $\sqrt{d_k}$ keeps the scores around unit size. That detail is from the original Transformer paper, and it is the only piece of (1) we will lean on.

Consider what happens at generation. When the model produces token 100, it needs the keys and values of tokens 1 through 99 to attend to them. Those keys and values do not change when a new token arrives, so recomputing them every step would be pure waste. Instead they are computed once and kept. That store is the KV cache: every token's key and value, held in memory so each new token can attend to all of them without redoing the work. The query is the one piece never cached, and here is the asymmetry: each past token's key and value depend only on that token and never change, so they can be stored, while the query belongs to the new token being generated, fresh each step, used once against all the stored keys and then discarded.

The cache grows by one row per token, and every new token has to read all of it. Drag the position below and watch both happen at once: the cache climbing, and the current token reaching back across all of it.

Figure 1 · the growing KV cache

position17 tok

Each generated token caches its key and value so they are never recomputed. The current token then attends to the whole cache. So the store, and the memory read per step, climb with the sequence length. That climb is the cost Mistral is built to flatten.

How much memory is this really? Per token, per layer, the cache holds two vectors (a key and a value), one for each attention head, each $d_{\text{head}}$ numbers wide. Across the stack:

\text{cache per token} = \underbrace{2}_{K,\,V}\times n_{\text{kv}}\times d_{\text{head}}\times n_{\text{layers}}

(2)

Put in plain multi-head numbers (a key and value for all $32$ heads), that is:

2 \times 32 \times 128 \times 32 \approx 262{,}000 \text{ numbers per token}

about half a megabyte in 16-bit precision. A single 32,000-token context would then need roughly $16$ GB of cache, on top of the weights, for one sequence. That is the memory problem. And the worst of it is that generating each token does very little arithmetic (one new token's worth) while having to read that entire cache back from memory. Decoding is memory-bandwidth bound: the GPU spends its time waiting on memory, not computing. So shrinking the cache buys speed directly, and it lets more sequences share a GPU. Each of Mistral's ideas attacks a part of this cache: grouped-query attention shrinks $n_{\text{kv}}$ , the sliding window and rolling buffer cap how many tokens are kept, and chunked pre-fill holds down the cost of filling it from a long prompt.

Check the bound yourself below: slide the cache length and the bytes and the FLOPs climb together, so the ratio between them never moves.

Figure 2 · why decoding waits on memory

cached tokens4,096 tok

32 KV heads

One decode step, in the model's own numbers from (2), at 16-bit precision. Per cached token, attention does 32 q-heads × 512 × 32 layers ≈ 0.52 MFLOP and streams 2 × 32 × 128 × 32 × 2 B ≈ 0.52 MB, exactly 1 FLOP per byte. Every kernel has an arithmetic intensity (its FLOPs per byte of memory traffic); a GPU needs on the order of 150 FLOPs per byte before arithmetic, not memory, becomes the limit (a typical compute-to-bandwidth ratio, not a specific card's spec). Since this step does 1 FLOP per byte, far below 150, it is memory-bound at every cache size. GQA cuts the bytes 4x and lifts the intensity to 4, still far below the line. The weights are a further fixed read on top, every step.

Grouped-query attention: share the keys and values

The cache is dominated by keys and values stored separately for each head. The standard design, multi-head attention, gives every one of the 32 query heads its own key and value head. That is what makes the heads independent, and it is also what makes the cache big.

At the far end of the spectrum is multi-query attention (Shazeer, 2019, in a paper titled, with a wink at the Transformer, One Write-Head is All You Need). It keeps the 32 query heads but gives them all a single shared key and value head. The cache shrinks by a factor of 32, and decoding gets much faster, since the thing you reload each step is now tiny. The cost is some quality, because one shared key/value is a real constraint on what the heads can do.

Grouped-query attention (Ainslie et al., 2023) is the sensible middle. Divide the query heads into $G$ groups, and let each group share one key/value head. $G = 1$ is multi-query; $G = 32$ (one per head) is back to multi-head. Mistral picks $G = 8$ : eight key/value heads for thirty-two query heads, so four query heads share each key/value head. The cache drops by a factor of four (the 16 GB above becomes 4), while quality stays close to full multi-head.

There are 32 readers (the query heads). Multi-head gives each reader a private copy of the reference book. Multi-query makes all 32 share one copy. Grouped-query keeps eight copies, one per table of four. The reason sharing is cheap in quality lives in how the two sides differ: the queries are the side asking diverse questions, the keys and values are the reference being consulted, and many different questions can be asked of one reference, so shrinking the reference 4x shrinks the cache 4x while the questions stay as varied as ever. Toggle the three settings and watch the wiring collapse and the cache bar shrink:

Figure 3 · grouped-query attention

8 KV

GQA-8 · Mistral: 4 query heads share each KV head

32 query heads wired to a smaller number of key/value heads. Multi-head keeps 32 (one each); multi-query collapses to a single shared one; Mistral's grouped-query keeps 8, so each is shared by 4 queries. The cache scales with the number of key/value heads, so 8 is a 4x smaller cache than 32.

Grouped-query attention is not multi-query attention. The query heads are untouched, all 32 of them. Only the keys and values are shared, and Mistral shares them in groups of four, for a 4x cache reduction. It is not the 32x reduction of multi-query, and calling Mistral's attention "multi-query" gets both the mechanism and the number wrong.

Sliding-window attention: stop looking at everything

Grouped-query attention shrank the cache by a constant factor. It did nothing about the other problem: the cache still grows with the length of the sequence, because in ordinary attention every token attends to every earlier token. Double the context and you double the cache. The growth is linear in length per layer, and the compute is quadratic. That is the part that makes long contexts expensive.

Mistral borrows a move older than itself, from work on long-sequence transformers like Sparse Transformers and Longformer. Instead of letting a token attend to the entire past, let it attend only to a fixed window of the most recent $W$ tokens. Mistral uses $W = 4096$ . Per layer, the work and the cache are now bounded by $W$ , not by the sequence length. (The clean "linear in length, window $W$ " framing is Longformer's; the Mistral paper only says vanilla attention is quadratic and that the window reduces cost.)

The obvious worry: if a token can only see 4096 back, hasn't the model lost long-range context? It hasn't lost the long-range view, and the reason is depth. Information does not have to cover the full distance in one layer. A token's output at layer 1 already summarizes its window; at layer 2 it attends to neighbors that have summarized their windows; and so on up the stack. Each layer extends the reach by another $W$ . Information two windows away arrives second-hand, already digested into a neighbor's layer-1 summary, so $k$ layers buy $k$ windows of reach, which is the widening cone Figures 4 and 5 draw. This is how stacked convolutions build a wide receptive field (the span of input that can affect one output) out of small kernels. After $k$ layers a token can be influenced by tokens far outside any single window:

\text{reach after } k \text{ layers} \;\approx\; k\,W \;=\; 32 \times 4096 \;=\; 131{,}072 \text{ tokens}

(3)

Pick an output token and a window below, and compare the two regimes. Full attention floods all earlier tokens in a single layer. The window reaches only $W$ per layer, but the dependency cone widens as you go down the stack, until the deepest layer is reachable from far away:

Figure 4 · sliding window vs full attention

W = 2

outputpos 14

Rows are layers (input at the bottom, output at the top); columns are positions. With sliding-window attention each layer reaches only

W

back, but stacking widens the cone by

W

per layer, so a deep output depends on tokens far outside any single window. Full attention reaches everything in one layer, at the cost of an unbounded cache.

Figure 4 fixed the depth at six layers so the window mechanic was easy to read. The actual model stacks $32$ . So push the depth itself and watch where the cone reaches: at $k = 1$ there is no propagation and only the literal $W$ tokens are in range; at $k = 32$ the cone has widened to the paper's headline number, $32 \times 4096 = 131{,}072$ .

Figure 5 · the information cone: how depth buys reach

depthk = 8

The cone of tokens that can in principle reach the target, drawn as a function of stack depth. The target sits at the bottom-right; each band above it is one layer; layer

\ell

reaches

\ell\,W

tokens back. Slide

k

from 1 to 32 and the readout climbs from 4096 to 131,072. This is the THEORETICAL maximum (each layer's attention is uniform within its window). Real models attenuate over distance, so the cone shows the upper bound geometry, not what gets used in practice.

Be exact about that $131$ K number. It is a theoretical receptive field, the farthest a signal could in principle travel through 32 stacked windows. It is not a context window you can fill. Mistral 7B was trained at a sequence length of 8192. And the window, $W = 4096$ , is the per-layer attention radius, a different quantity from both the 8192 context and the 4096 model width that happens to share its digits.

Shrink the window to a toy $W = 4$ just to see it on a page (the real Mistral window is 4096; the arithmetic is identical, only the count changes), and number the tokens of a sentence: [0:the 1:fox 2:ran 3:past 4:the 5:dog]. Generate token 5,dog. Its window is the four most recent positions including itself, positions $2,3,4,5$ , so token 5 reads the keys and values of ran past the dogstraight from the cache. Positions 0 and 1, the fox, fall outside the window: they are $5 - 0 = 5$ and $5 - 1 = 4$ steps back, both at or past the radius of 4, so this layer never reads them and the rolling buffer has already overwritten their slots. The output for dog is then a single softmax-weighted blend of just the four in-window values, a score against each of the four in-window keys turned into weights that sum to one, with the fox contributing exactly zero at this layer. The information in those dropped tokens is not gone from the model, it reaches dog only second-hand, already folded into the layer-below summaries of the tokens that are in the window, which is the cone from Figure 4 doing its work.

The rolling buffer cache

The window also saves memory, and that saving follows from a single fact. If no token ever attends past $W$ positions back, then the cache never needs to hold more than $W$ entries. Everything older is dead weight, because no future token will ever read it.

So the cache is built as a fixed array of exactly $W$ slots, a rolling buffer. Token $i$ is written to slot $i \bmod W$ . While $i < W$ the buffer fills up normally. Once $i$ passes $W$ , the write wraps back to the start and overwrites the oldest entry, which is exactly the one that just fell out of every token's window. The cache stops growing and sits at a fixed size forever:

\text{slot}(i) = i \bmod W

(4)

# decode step i, with sliding window W
k_i, v_i     = project(x_i)         # this token's key + value
cache[i % W] = (k_i, v_i)           # write, wrapping around
window       = recent(cache, W)     # the most recent W entries
attn         = softmax(q_i @ window.keys.T) @ window.values

Advance the position below and watch the write head wrap around at $W$ , overwriting the oldest token while the size stays pinned:

Figure 6 · the rolling buffer

positioni = 9

A fixed array of

W

slots. Token

i

writes to slot i mod W. Past

W

, the write head wraps and overwrites the oldest entry, which has already left every window, so the cache size never grows.

The paper measures the saving directly: on a 32,000-token sequence the rolling buffer holds 4096 entries instead of 32,000, an 8x cut in cache memory, with no measured loss in quality. Stacked on grouped-query attention, the two reductions cascade. A naive multi-head cache of about 16 GB becomes 4 GB with grouped-query, and then about 0.5 GB once the window caps it.

Pre-fill and chunking

The last efficiency detail is about the prompt rather than the generation. Generating is slow because each token depends on the last, so they come out one at a time. But the prompt is different: you have all of it already. There is no reason to feed it through one token at a time. You run the entire prompt through in a single parallel pass and fill the cache from it in one shot. This phase is called pre-fill, and it is why the first generated token can take a moment while the rest stream out quickly.

For a very long prompt, even that single pass can blow the memory budget, so Mistral processes it in chunks the size of the window. What blows the budget is attention itself: a prompt run in one shot materializes attention over its full length at once, a spike that grows with its length, while a window-sized slab attending against the rolling cache keeps the peak bounded no matter how long the prompt is. Each chunk attends to itself, with a causal mask (each token sees itself and earlier tokens, never later ones), and to the cache the earlier chunks left behind. The attention mask for a chunk has three regions, and seeing them is the cleanest summary of how sliding-window attention actually runs. Step through the chunks below:

Figure 7 · pre-fill and chunking

chunk3 / 3

The mask for the current chunk has three parts. It attends to itself with a causal mask (rightmost), to the in-window part of the cache with a sliding-window mask (center), and not at all to tokens that have scrolled out of the window (left). Words from the paper's own example.

That three-region mask is the full mechanism in one picture. A token looks at its own chunk up to itself (causal, so it never peeks ahead), at the slice of older tokens still inside its window (the rolling cache), and at nothing further back. The same rule that bounded the cache also bounds the work of reading a long prompt.

What it buys

That is the architecture. The benchmarks are why it mattered. Mistral re-ran every benchmark through its own evaluation pipeline (so the numbers differ slightly from the figures in the Llama 2 paper) and compared like for like. Against Llama 2 13B, a model nearly twice its size, Mistral 7B wins on eleven of the twelve listed benchmarks, and the margins on math and code are large. Pick a benchmark and compare the four models:

Figure 8 · the benchmark sweep

benchmarkGSM8K

Mistral 7B against Llama 2 7B/13B and Code-Llama 7B, numbers verbatim from Table 2. It tops Llama 2 13B nearly everywhere, hugely on GSM8K (52.2 vs 34.3) and MATH (13.1 vs 6.0). The two exceptions show up directly: NaturalQuestions, and code, where the specialist Code-Llama edges it.

The abstract says Mistral beats Llama 2 13B "across all evaluated benchmarks." The paper's own Table 2 has one exception: NaturalQuestions, where Mistral scores 28.8 against Llama 2 13B's 29.0. The gap is 0.2 points, well inside evaluation noise, and the figure in the paper hides it by bucketing NaturalQuestions and TriviaQA together into one "world knowledge" category, where the TriviaQA win carries the bucket. It is still an exception to the word "all." Knowledge is Mistral's relative weak spot, which fits: storing facts takes parameters, and it has fewer of them.

Code is the other place that rewards a close look. Mistral's code scores (HumanEval 30.5, MBPP 47.5) are excellent for a general model, far above Llama 2 13B, but the dedicated Code-Llama 7B still edges it on both. The paper does not overreach here: Mistral "approaches" Code-Llama's coding performance "without sacrificing performance on non-code benchmarks."

To put the efficiency in one figure, Mistral measured an "equivalent model size": how big a Llama 2 would have to be to match Mistral on a given task. On reasoning, comprehension, and MMLU (Massive Multitask Language Understanding, a 57-subject knowledge-and-reasoning test), Mistral 7B performs like a Llama 2 of more than three times its size. On knowledge benchmarks the multiple drops to about 1.9, the parameter limit showing through again. The takeaway is the same either way: this is a small model behaving like a much larger one.

The release also included an instruction-tuned model, Mistral 7B Instruct: the base model trained a little further on examples of following instructions, using only public data and no proprietary tricks. It beats every other 7B chat model on MT-Bench, a benchmark that scores multi-turn chat answers out of ten (it scored 6.84), and is competitive with 13B chat models. On the Chatbot Arena leaderboard, which ranks models by head-to-head human votes, it landed at an Elo rating of 1031 (the chess-style relative score where higher wins more often), above Llama 2 13B Chat's 1012, and in a direct human comparison its answers were preferred 5020 times to Llama 2 13B Chat's 4143.

Scaling in three dimensions

The conclusion is one paragraph long, and it reframes what the exercise was for. For a few years the conversation about scale was effectively two-dimensional: how good a model is, against how much compute it took to train. The Chinchilla scaling laws are the clearest version of that view. For a fixed training budget, they give the best split between model size and training tokens (about 20 tokens per parameter) to reach the lowest loss.

Mistral's point is that this leaves out the axis that decides whether a model is usable in production: inference cost. Chinchilla does not optimize for it. A compute-optimal model is incidentally cheaper to run than an over-large one, but that is a by-product of the objective, not its goal. The real problem, the paper argues, is three-dimensional: model capability, training cost, and inference cost. You can spend more training compute on a smaller model than Chinchilla would call optimal, because the small model is then cheap to run on every token forever, while the extra training is a one-time bill. The sentence the paper leads with is the one to keep: "language models may compress knowledge more than what was previously thought."

This is what ties the architecture to the thesis. Grouped-query attention, the sliding window, the rolling buffer, and chunked pre-fill are all levers on the inference axis. They make a small, heavily trained model practical to serve, which is what made an Apache-licensed 7B model worth releasing. None of the four ideas was new on its own. The contribution is putting them together in a model that beats ones twice its size, and saying plainly that inference cost is a first-class scaling dimension. For a model you deploy, the deciding question is how cheap it is to run, and most of that cost is the attention cache.

Provenance Verified against primary literature

GQA (Ainslie et al., 2023)Groups of query heads share one key/value head. The middle ground between multi-head and multi-query attention.

MQA (Shazeer, 2019)A single shared key/value head. Decoding is memory-bandwidth bound, so a smaller cache is faster.

Local attention (Longformer, 2020)A window is linear in length, and the receptive field grows with depth, the way it does in stacked convolutions.

LLaMA family (Touvron et al., 2023)RoPE, SwiGLU, RMSNorm. Inherited by Mistral. The Mistral paper itself does not spell them out.

Chinchilla (Hoffmann et al., 2022)Compute-optimal training, about 20 tokens per parameter. The two-axis scaling Mistral extends to a third, inference.

correctionThe abstract says Mistral beats Llama 2 13B on every evaluated benchmark, but the paper’s own Table 2 puts it 0.2 points behind on NaturalQuestions (28.8 vs 29.0), the lone exception of twelve. We report the gap rather than repeat “all.”

Questions you might still have

If each layer only looks at 4096 tokens, how does Mistral use a longer context?
Depth. Stacking 32 windowed layers compounds the reach to about 32 x 4096 = 131K tokens, because each layer can pull in information another window deep. The window is a per-layer radius, not the context limit (the model is trained at length 8192).

Is grouped-query attention just multi-query attention?
No. Multi-query uses a single shared key/value head, shrinking the cache about 32x but costing some quality. Grouped-query uses several (Mistral keeps 8 for its 32 query heads), shrinking the cache 4x while staying close to full multi-head quality.

Does the sliding window throw away quality?
The paper reports the rolling buffer cuts cache memory 8x at 32k tokens with no measured quality drop, and Mistral still tops Llama 2 13B almost everywhere. Local attention plus depth recovers the long-range information a single wide layer would have carried.

Did Mistral invent any of this?
No single piece. Grouped-query attention, local/sliding-window attention, and the KV-cache tricks all predate it, and the rest of the model is Llama. The contribution is assembling them into a small model that is both strong and cheap, and showing it beats ones nearly twice its size.

Footnotes & further reading

The paper: Jiang, Sablayrolles, Mensch, et al., Mistral 7B (Mistral AI, 2023). Reference code.
Grouped-query attention: Ainslie et al., GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (2023). Multi-query attention: Shazeer, Fast Transformer Decoding (2019).
Sliding-window and local attention: Child et al., Generating Long Sequences with Sparse Transformers (2019), and Beltagy et al., Longformer (2020), source of the linear-cost and receptive-field framing.
The Llama-inherited ingredients (not detailed in the Mistral paper): Touvron et al., LLaMA (2023); RoPE (Su et al., 2021), SwiGLU (Shazeer, 2020), RMSNorm (Zhang & Sennrich, 2019).
Scaled dot-product attention and the $\sqrt{d_k}$ scaling: Vaswani et al., Attention Is All You Need (2017).
The two-axis scaling Mistral extends: Hoffmann et al., Training Compute-Optimal Large Language Models (Chinchilla, 2022). The kernels behind the 2x speedup: FlashAttention (Dao et al., 2022) and xFormers.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.