Mistral 7B
A 7B model that beats a 13B, by fixing attention.
Mistral 7B is, under the hood, an ordinary Llama-style transformer. What makes it punch above its weight is a handful of changes to attention that make it cheap to run. This is an efficiency paper, and the savings all land in one place: the KV cache.
Explaining the paperMistral 7BHow does a 7-billion-parameter model keep up with one nearly twice its size? Mostly by being cheaper to run.
In late 2023 a small team at Mistral AI released a 7-billion-parameter language model that did something the size charts said it should not. Mistral 7B beat Llama 2 13B, a model with almost twice as many parameters, on nearly every standard benchmark, and it matched or beat Llama 1 34B on reasoning, mathematics, and code. It shipped under an Apache 2.0 license, which meant anyone could use it for anything, and for a while it became the default open model people reached for.
The surprising part is how little of the paper is about the model itself. There is no secret training recipe in it, no exotic new layer, no claim about the data. Mistral 7B is a fairly standard transformer in the Llama lineage. It uses the same now-ordinary ingredients Llama introduced, rotary positional embeddings (RoPE), a SwiGLU feed-forward block, and RMSNorm. (The paper does not actually walk through those three. It treats them as settled and inherited, and so will we. They are confirmed by the reference implementation, not by the text.) What the paper actually details is narrower: how to make the model cheap to run.
That emphasis is the whole point, and it is worth saying plainly before any of the mechanism. For a model you train once and then deploy, the bill that never stops arriving is the inference bill. You pay it on every token you ever generate, for every user, forever. And almost all of that cost, at the long context lengths people actually want, is memory traffic in the attention layer rather than arithmetic. The bottleneck is the reading and writing of something called the KV cache. Mistral 7B is four ideas about that cache: grouped-query attention, a sliding window, a rolling buffer, and chunked pre-fill. Understand those and you have the paper.
The cost that matters is inference
A language model generates text one token at a time. To produce the next word it runs a full forward pass through all its layers (one trip through the network, input to output), appends the new token, and runs another full pass for the word after that. This is what autoregressive means: each token is conditioned on every token before it, so they have to be produced in sequence, one pass each.
Two things get loaded from memory on every one of those passes. The first is the model's weights, which is a fixed cost set by the parameter count. The second grows: the running memory of everything said so far, held in the attention layers. At a short prompt this second cost is nothing. At a long one it dominates, and it is the part Mistral attacks. The model has to be accurate and keep that growing cost small, so a 7B model can serve long contexts on hardware that would choke on a larger one.
Most of the architecture is unremarkable; the two lines that carry the paper are n_kv_heads and window_size. The full configuration, so the numbers later have somewhere to land:
dim = 4096 # model width (d_model)
n_layers = 32 # transformer blocks stacked
n_heads = 32 # query heads
n_kv_heads = 8 # key/value heads -> GQA, groups of 4
head_dim = 128 # per-head width (32 x 128 = 4096)
hidden_dim = 14336 # SwiGLU feed-forward inner width
window_size = 4096 # sliding-window attention radius (W)
context_len = 8192 # trained sequence length
vocab_size = 32000 # SentencePiece tokensOne trap to flag up front, because the same number shows up twice for unrelated reasons. dim = 4096 is the model's width, the length of each token's vector. window_size = 4096 is how far back attention reaches. They are equal by coincidence and mean completely different things. And neither is the context length, which is . Keep the three apart and the rest is clear.
Attention, and the KV cache
Start with what attention does on a single layer, because the cache falls straight out of it. Each token already lives as a vector (its embedding), and a layer projects that vector into three others: a query , a key , and a value . To compute a token's output, you take its query, score it against the key of every token (a dot product, which is large when two vectors point the same way), turn those scores into weights with a softmax (which rescales a list of numbers into positive weights that sum to one), and return the weighted average of the values. That is the whole operation:
The in the denominator (with the dimension of a key) is just a scaling constant. Without it, in high dimensions the dot products grow large, the softmax puts almost all the weight on one token, and gradients vanish. Dividing by keeps the scores around unit size. That detail is from the original Transformer paper, and it is the only piece of (1) we will lean on.
Now watch what happens at generation. When the model produces token 100, it needs the keys and values of tokens 1 through 99 to attend to them. Those keys and values do not change when a new token arrives, so recomputing them every step would be pure waste. Instead they are computed once and kept. That store is the KV cache: every token's key and value, held in memory so each new token can attend to all of them without redoing the work. The query is the one piece never cached, and the asymmetry is the point: each past token's key and value depend only on that token and never change, so they are computed once and kept, while the query belongs to the new token being generated, fresh each step, used once against all the stored keys and then discarded.
The cache grows by one row per token, and every new token has to read the whole thing. Drag the position below and watch both happen at once: the cache climbing, and the current token reaching back across all of it.
How much memory is this really? Per token, per layer, the cache holds two vectors (a key and a value), one for each attention head, each numbers wide. Across the stack:
Put in plain multi-head numbers (a key and value for all heads), that is:
about half a megabyte in 16-bit precision. A single 32,000-token context would then need roughly GB of cache, on top of the weights, for one sequence. That is the wall. And the worst of it is that generating each token does very little arithmetic (one new token's worth) while having to read that entire cache back from memory. Decoding is memory-bandwidth bound: the GPU spends its time waiting on memory, not computing. So shrinking the cache is not a side optimization. It buys speed directly, and it lets more sequences share a GPU. Each of Mistral's ideas attacks a part of this cache: grouped-query attention shrinks , the sliding window and rolling buffer cap how many tokens are kept, and chunked pre-fill holds down the cost of filling it from a long prompt.
Check the bound yourself below: slide the cache length and the bytes and the FLOPs climb together, so the ratio between them never moves.
Grouped-query attention: share the keys and values
The cache is dominated by keys and values stored separately for each head. The standard design, multi-head attention, gives every one of the 32 query heads its own key and value head. That is what makes the heads independent, and it is also what makes the cache big.
At the far end of the spectrum is multi-query attention (Shazeer, 2019, in a paper titled, with a wink at the Transformer, One Write-Head is All You Need). It keeps the 32 query heads but gives them all a single shared key and value head. The cache shrinks by a factor of 32, and decoding gets much faster, since the thing you reload each step is now tiny. The cost is some quality, because one shared key/value is a real constraint on what the heads can do.
Grouped-query attention (Ainslie et al., 2023) is the sensible middle. Divide the query heads into groups, and let each group share one key/value head. is multi-query; (one per head) is back to multi-head. Mistral picks : eight key/value heads for thirty-two query heads, so four query heads share each key/value head. The cache drops by a factor of four (the 16 GB above becomes 4), while quality stays close to full multi-head.
There are 32 readers (the query heads). Multi-head gives each reader a private copy of the reference book. Multi-query makes all 32 share one copy. Grouped-query keeps eight copies, one per table of four. The reason sharing is cheap in quality lives in how the two sides differ: the queries are the side asking diverse questions, the keys and values are the reference being consulted, and many different questions can be asked of one reference, so shrinking the reference 4x shrinks the cache 4x while the questions stay as varied as ever. Toggle the three settings and watch the wiring collapse and the cache bar shrink:
GQA-8 · Mistral: 4 query heads share each KV head
One thing to be careful about, since it is the most common way to misread this. Grouped-query attention is not multi-query attention. The query heads are untouched, all 32 of them. Only the keys and values are shared, and Mistral shares them in groups of four, for a 4x cache reduction. It is not the 32x reduction of multi-query, and calling Mistral's attention "multi-query" gets both the mechanism and the number wrong.
Sliding-window attention: stop looking at everything
Grouped-query attention shrank the cache by a constant factor. It did nothing about the other problem: the cache still grows with the length of the sequence, because in ordinary attention every token attends to every earlier token. Double the context and you double the cache. The growth is linear in length per layer, and the compute is quadratic. That is the part that makes long contexts expensive.
The fix is older than Mistral and comes from work on long-sequence transformers like Sparse Transformers and Longformer. Instead of letting a token attend to the entire past, let it attend only to a fixed window of the most recent tokens. Mistral uses . Per layer, the work and the cache are now bounded by , not by the sequence length. (The clean "linear in length, window " framing is Longformer's; the Mistral paper only says vanilla attention is quadratic and that the window reduces cost.)
The obvious worry: if a token can only see 4096 back, hasn't the model lost long-range context? It hasn't lost the long-range view, and the reason is depth. Information does not have to make the whole jump in one layer. A token's output at layer 1 already summarizes its window; at layer 2 it attends to neighbors that have summarized their windows; and so on up the stack. Each layer extends the reach by another . Information two windows away arrives second-hand, already digested into a neighbor's layer-1 summary, so layers buy windows of reach, which is the widening cone Figures 4 and 5 draw. This is how stacked convolutions build a wide receptive field (the span of input that can affect one output) out of small kernels. After layers a token can be influenced by tokens far outside any single window:
Pick an output token and a window below, and compare the two regimes. Full attention floods the whole past in a single layer. The window reaches only per layer, but the dependency cone widens as you go down the stack, until the deepest layer is reachable from far away:
Figure 4 fixed the depth at six layers so the window mechanic was easy to read. The actual model stacks . So push the depth itself and watch where the cone reaches: at there is no propagation and only the literal tokens are in range; at the cone has widened to the paper's headline number, .
That K number is the most misread fact in the paper, so be exact about it. It is a theoretical receptive field, the farthest a signal could in principle travel through 32 stacked windows. It is not a context window you can fill. Mistral 7B was trained at a sequence length of 8192. And the window, , is the per-layer attention radius, a different quantity from both the 8192 context and the 4096 model width that happens to share its digits.
One step traced makes the whole rule concrete. Shrink the window to a toy just to see it on a page (the real Mistral window is 4096; the arithmetic is identical, only the count changes), and number the tokens of a sentence: [0:the 1:fox 2:ran 3:past 4:the 5:dog]. Generate token 5,dog. Its window is the four most recent positions including itself, positions , so token 5 reads the keys and values of ran past the dogstraight from the cache. Positions 0 and 1, the fox, fall outside the window: they are and steps back, both at or past the radius of 4, so this layer never reads them and the rolling buffer has already overwritten their slots. The output for dog is then a single softmax-weighted blend of just the four in-window values, a score against each of the four in-window keys turned into weights that sum to one, with the fox contributing exactly zero at this layer. The information in those dropped tokens is not gone from the model, it reaches dog only second-hand, already folded into the layer-below summaries of the tokens that are in the window, which is the cone from Figure 4 doing its work.
The rolling buffer cache
The window also saves memory, and that saving follows from a single fact. If no token ever attends past positions back, then the cache never needs to hold more than entries. Everything older is dead weight, because no future token will ever read it.
So the cache is built as a fixed array of exactly slots, a rolling buffer. Token is written to slot . While the buffer fills up normally. Once passes , the write wraps back to the start and overwrites the oldest entry, which is exactly the one that just fell out of every token's window. The cache stops growing and sits at a fixed size forever:
# decode step i, with sliding window W
k_i, v_i = project(x_i) # this token's key + value
cache[i % W] = (k_i, v_i) # write, wrapping around
window = recent(cache, W) # the most recent W entries
attn = softmax(q_i @ window.keys.T) @ window.valuesAdvance the position below and watch the write head wrap around at , overwriting the oldest token while the size stays pinned:
The paper measures the saving directly: on a 32,000-token sequence the rolling buffer holds 4096 entries instead of 32,000, an 8x cut in cache memory, with no measured loss in quality. Stacked on grouped-query attention, the cascade is the headline of the whole architecture. A naive multi-head cache of about 16 GB becomes 4 GB with grouped-query, and then about 0.5 GB once the window caps it. The model did not get weaker. It got cheap to serve.
Pre-fill and chunking
One efficiency detail remains, and it is about the prompt rather than the generation. Generating is slow because each token depends on the last, so they come out one at a time. But the prompt is different: you have all of it already. There is no reason to feed it through one token at a time. You run the whole prompt through in a single parallel pass and fill the cache from it in one shot. This phase is called pre-fill, and it is why the first generated token can take a moment while the rest stream out quickly.
For a very long prompt, even that single pass can blow the memory budget, so Mistral processes it in chunks the size of the window. What blows the budget is attention itself: a prompt run in one shot materializes attention over the whole prompt at once, a spike that grows with its length, while a window-sized slab attending against the rolling cache keeps the peak bounded no matter how long the prompt is. Each chunk attends to itself, with a causal mask (each token sees itself and earlier tokens, never later ones), and to the cache the earlier chunks left behind. The attention mask for a chunk has three regions, and seeing them is the cleanest summary of how sliding-window attention actually runs. Step through the chunks below:
That three-region mask is the whole mechanism in one picture. A token looks at its own chunk up to itself (causal, so it never peeks ahead), at the slice of older tokens still inside its window (the rolling cache), and at nothing further back. The same rule that bounded the cache also bounds the work of reading a long prompt.
What it buys
That is the architecture. The benchmarks are why it mattered. Mistral re-ran every benchmark through its own evaluation pipeline (so the numbers differ slightly from the figures in the Llama 2 paper) and compared like for like. Against Llama 2 13B, a model nearly twice its size, Mistral 7B wins on eleven of the twelve listed benchmarks, and the margins on math and code are large. Pick a benchmark and compare the four models:
One claim is worth correcting. The abstract says Mistral beats Llama 2 13B "across all evaluated benchmarks." The paper's own Table 2 has one exception: NaturalQuestions, where Mistral scores 28.8 against Llama 2 13B's 29.0. The gap is 0.2 points, well inside evaluation noise, and the figure in the paper quietly buries it by bucketing NaturalQuestions and TriviaQA together into one "world knowledge" category, where the TriviaQA win carries the bucket. It is still an exception to the word "all." Knowledge is Mistral's relative weak spot, which fits: storing facts takes parameters, and it has fewer of them.
Code is the other place to read carefully. Mistral's code scores (HumanEval 30.5, MBPP 47.5) are excellent for a general model, far above Llama 2 13B, but the dedicated Code-Llama 7B still edges it on both. The paper says this honestly: Mistral "approaches" Code-Llama's coding performance "without sacrificing performance on non-code benchmarks." Approaches, not beats. That restraint is exactly right.
To put the efficiency in one figure, Mistral measured an "equivalent model size": how big a Llama 2 would have to be to match Mistral on a given task. On reasoning, comprehension, and MMLU, Mistral 7B performs like a Llama 2 of more than three times its size. On knowledge benchmarks the multiple drops to about 1.9, the parameter limit showing through again. The headline is the same either way: this is a small model behaving like a much larger one.
The release also included an instruction-tuned model, Mistral 7B Instruct: the base model trained a little further on examples of following instructions, using only public data and no proprietary tricks. It beats every other 7B chat model on MT-Bench, a benchmark that scores multi-turn chat answers out of ten (it scored 6.84), and is competitive with 13B chat models. On the Chatbot Arena leaderboard, which ranks models by head-to-head human votes, it landed at an ELO of 1031, above Llama 2 13B Chat's 1012, and in a direct human comparison its answers were preferred 5020 times to Llama 2 13B Chat's 4143. A simple fine-tune of a small base model, beating a chat model twice its size.
Scaling in three dimensions
The conclusion is one paragraph long, and it reframes what the whole exercise was for. For a few years the conversation about scale was effectively two-dimensional: how good a model is, against how much compute it took to train. The Chinchilla scaling laws are the clearest version of that view. For a fixed training budget, they give the best split between model size and training tokens (about 20 tokens per parameter) to reach the lowest loss.
Mistral's point is that this leaves out the axis that decides whether a model is usable in production: inference cost. Chinchilla does not optimize for it. A compute-optimal model is incidentally cheaper to run than an over-large one, but that is a by-product of the objective, not its goal. The real problem, the paper argues, is three-dimensional: model capability, training cost, and inference cost. You can spend more training compute on a smaller model than Chinchilla would call optimal, because the small model is then cheap to run on every token forever, while the extra training is a one-time bill. The sentence the paper leads with is the one to keep: "language models may compress knowledge more than what was previously thought."
This is what ties the architecture to the thesis. Grouped-query attention, the sliding window, the rolling buffer, and chunked pre-fill are all levers on the inference axis. They make a small, heavily trained model practical to serve, which is what made an Apache-licensed 7B model worth releasing. None of the four ideas was new on its own. The contribution is putting them together in a model that beats ones twice its size, and saying plainly that inference cost is a first-class scaling dimension. For a model you deploy, the deciding question is how cheap it is to run, and most of that cost is the attention cache.
Questions you might still have
If each layer only looks at 4096 tokens, how does Mistral use a longer context?
Depth. Stacking 32 windowed layers compounds the reach to about 32 x 4096 = 131K tokens, because each layer can pull in information another window deep. The window is a per-layer radius, not the context limit (the model is trained at length 8192).
Is grouped-query attention just multi-query attention?
No. Multi-query uses a single shared key/value head, shrinking the cache about 32x but costing some quality. Grouped-query uses several (Mistral keeps 8 for its 32 query heads), shrinking the cache 4x while staying close to full multi-head quality.
Does the sliding window throw away quality?
The paper reports the rolling buffer cuts cache memory 8x at 32k tokens with no measured quality drop, and Mistral still tops Llama 2 13B almost everywhere. Local attention plus depth recovers the long-range information a single wide layer would have carried.
Did Mistral invent any of this?
No single piece. Grouped-query attention, local/sliding-window attention, and the KV-cache tricks all predate it, and the rest of the model is Llama. The contribution is assembling them into a small model that is both strong and cheap, and showing it beats ones nearly twice its size.
Footnotes & further reading
- The paper: Jiang, Sablayrolles, Mensch, et al., Mistral 7B (Mistral AI, 2023). Reference code.
- Grouped-query attention: Ainslie et al., GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (2023). Multi-query attention: Shazeer, Fast Transformer Decoding (2019).
- Sliding-window and local attention: Child et al., Generating Long Sequences with Sparse Transformers (2019), and Beltagy et al., Longformer (2020), source of the linear-cost and receptive-field framing.
- The Llama-inherited ingredients (not detailed in the Mistral paper): Touvron et al., LLaMA (2023); RoPE (Su et al., 2021), SwiGLU (Shazeer, 2020), RMSNorm (Zhang & Sennrich, 2019).
- Scaled dot-product attention and the scaling: Vaswani et al., Attention Is All You Need (2017).
- The two-axis scaling Mistral extends: Hoffmann et al., Training Compute-Optimal Large Language Models (Chinchilla, 2022). The kernels behind the 2x speedup: FlashAttention (Dao et al., 2022) and xFormers.
How could this explainer be improved? Found an error, or something unclear? I read every message.