VerifiedarXiv:2302.1397130 min
LLMs · Architecture

LLaMA: Open and Efficient Foundation Language Models

Train a small model far too long, so it is cheap to run.

LLaMA introduced no new architecture. Its contribution is a bet about what to optimize: not the cost of training a model, but the cost of running it for the rest of its life. Push a small model far past the point a compute-optimal recipe would stop, and you get a model that is competitive and cheap to serve, trained on public data anyone can use.

Explaining the paperLLaMA: Open and Efficient Foundation Language ModelsTouvron, Lavril, Izacard, et al. · Meta AI · 2023 · arXiv:2302.13971

What if the right model to train is not the cheapest one to train, but a smaller one you train far longer?

By early 2023 the recipe for a large language model felt settled. Pick a parameter count, work out the compute-optimal number of training tokens from a scaling law, train once, ship. The scaling law everyone quoted was Chinchilla (Hoffmann et al., 2022), and it said something sharp: for a fixed training budget, you grow the model and the data together, roughly twenty training tokens for every parameter. By that measure the previous generation had been training models far too big on far too little text.

LLaMA, out of Meta AI, took that law seriously and then asked a different question with it. Chinchilla optimizes the cost of training, but a model you intend to deploy is trained once and then run millions of times, so the bill that matters is the lifetime one: the smallest model that clears your quality bar, pushed over that bar by training well past the point Chinchilla would have stopped.

That single change of objective is the whole paper. Everything else, the architecture and the training, is careful, well-documented borrowing: a standard Transformer with three modifications taken from models that came before, trained on a mixture of public datasets, with a couple of memory tricks to make it fit. The result is a family from 7B to 65B parameters where the 13B model clears GPT-3 (175B) on most benchmarks and the 65B trades blows with the best models of its day, all of it reproducible from public data.

The bet: train a small model too long

Start with what Chinchilla actually says, because the LLaMA bet is a careful departure from it, not a rejection. Chinchilla ran a large sweep and found that, to get the lowest loss out of a fixed amount of training compute, you should scale the model size NN and the number of training tokens DD in lockstep, about twenty tokens per parameter: a compute-optimal 10B model gets trained on roughly 200B tokens, and a model trained on much less than that for its size, the way GPT-3 was, is leaving performance on the table. The catch is that the compute behind a model has two halves that look nothing alike. Training happens once. Inference happens every time anyone uses the model, for as long as it is deployed, and its cost scales with the parameter count NN. The rough bill is one number, with TT the total tokens the model will ever be asked to produce:

cost    6NDtrain once  +  2NTserve forever\text{cost} \;\approx\; \underbrace{6\,N\,D}_{\text{train once}} \;+\; \underbrace{2\,N\,T}_{\text{serve forever}}
(1)

The factors are FLOP counting. A forward pass through a model is about 2N2N arithmetic operations per token, and the backward pass costs about twice the forward, so a training token costs roughly three forward passes' worth, 6ND6ND in total over DD tokens, while a served token costs the forward alone, 2N2N. Chinchilla minimizes only the first term, holding the training budget fixed, but for a model you ship the second term grows without bound as people use it, and it depends on NN alone, not on how long you trained. That reframes the design problem: if two models reach the same quality, the smaller one is strictly better to deploy, every query it answers is cheaper, so the goal becomes the smallest model that hits the target, paid for with whatever training it takes to get it there.

That only works if you can keep buying quality by training longer, and the empirical fact that makes the whole bet pay off is that you can. LLaMA's training loss curves, redrawn from the paper's Figure 1, make the case: none of them has flattened. Even the 7B, after a trillion tokens, is still visibly descending. The tick on each curve marks where Chinchilla would have stopped it, at about twenty tokens per parameter, and the small models run far past that line:

Figure 1 · train longer, keep falling
1000B
Training loss keeps dropping as you feed each model more tokens, and none of the curves has flattened. The ticks mark each model's Chinchilla-optimal stop (≈20 tok/param). LLaMA runs the small models many times past it: the 7B sees about 150 tokens per parameter, the 65B about 21. A small model trained this long is cheap to serve and still good.

The numbers make the asymmetry concrete. The 7B was trained on 1.0T tokens, which is about 150 tokens per parameter, seven times past its compute-optimal point. The 13B saw 77 per parameter. The 33B and 65B were trained on 1.4T tokens, landing at roughly 43 and 21 tokens per parameter. So the 65B is close to compute-optimal, while the small models are deliberately, heavily overtrained: extra training compute, paid once, buying a smaller model, whose size you pay for on every query.

None of this refutes Chinchilla. LLaMA is inspired by the Chinchilla scaling laws and uses them; it optimizes a different objective. Chinchilla answers "least loss per unit of training compute," LLaMA answers "least cost to reach a target and then deploy." The small overtrained models are worse per training FLOP than a compute-optimal model would be, exactly as Chinchilla predicts, and better per inference FLOP, the term Chinchilla never modeled. Later work made this formal by writing down the combined training-plus-inference objective and re-deriving the optimum, which comes out smaller and trained longer than Chinchilla's, as LLaMA did by hand.2

It helps to be exact about what each law is even measuring, because the loose reading of Chinchilla, that it tells you the right size to build, is the one to drop. Chinchilla's sweep fixes a target loss and asks the single question, what is the cheapest way to train a model that reaches it, and its answer counts only training FLOPs, the 6ND6ND term, paid exactly once. Inference is nowhere in that objective, so a model Chinchilla calls compute-optimal is just the cheapest one to train to a given quality, not the cheapest one to live with. LLaMA fixes the same target loss and asks a second question Chinchilla never posed, what is the cheapest way to own a model that reaches it, and that bill is dominated by the 2NT2NT term, the cost of every inference call summed over the model's whole deployed life. Those are two different minimizations of two different quantities. The compute-optimal model wins the first and a smaller, longer-trained one wins the second, so reading LLaMA as evidence that Chinchilla "trained too large" misses that the two are not competing for the same prize. Chinchilla answers its own question correctly; LLaMA just decided that question was the wrong one to ask about a model it intended to serve to millions of users.

Only public data

The other word in the title is "open," and it is doing real work. The strongest models of the time were trained on data nobody outside the lab could see: GPT-3, Chinchilla and PaLM list ingredients like "Books, 2TB" or "social media conversations" with no way to reproduce them. LLaMA's constraint was to use only publicly available, documented sources, so that the training set is something another researcher could in principle rebuild. That choice is the reason the model could be released and the reason an entire open ecosystem grew on top of it.

The mixture is about 1.4 trillion tokens, and it is mostly the open web. CommonCrawl and C4, two web crawls, are 82% of it between them, cleaned through a pipeline that deduplicates pages, throws out non-English text, and filters low-quality pages with a small classifier. The rest is curated: GitHub code under permissive licenses, Wikipedia in twenty languages, books, arXiv papers stripped down to their text, and Stack Exchange. The sources, by share and by how many times each is read:

Figure 2 · the public-data mixture
CC
LLaMA's 1.4T-token mix, all public. Web crawl (CommonCrawl plus C4) is 82% of the tokens; the rest is curated. The lower bars are epochs: most sources are seen about once, but the two cleanest, Wikipedia and Books, are passed over more than twice, and GitHub less than once.

The detail worth pausing on is the epochs, the lower bars. The sampling proportion is how much of each batch comes from a source; the number of epochs is how many times the model actually reads that source over the run, which is its share of the 1.4T training tokens divided by how many tokens the source actually holds. A small but heavily weighted source gets read several times. Most of the data is seen roughly once. The exceptions are Wikipedia and Books, the cleanest, most information-dense text, which get repeated past two epochs because a token of Wikipedia is worth more than a token of raw web. GitHub, by contrast, is sampled at less than one epoch; the model spends its repeats where they count.

Two small choices in the tokenizer matter later. LLaMA uses byte-pair encoding through SentencePiece, and it splits every number into individual digits, so "2023" becomes four tokens, not one. That forces the model to do arithmetic digit by digit rather than memorizing whole numbers as symbols, which tends to generalize better to numbers it never saw in training. Unknown characters fall back to raw bytes, so nothing is ever out of vocabulary.

One caveat the paper does not raise: "publicly available" meant reachable on the internet, not legally cleared. The Books portion drew on Books3, a corpus later shown to contain copyrighted books and the subject of takedowns and lawsuits. It is a useful reminder that "open data" and "licensed data" are not the same thing.

Normalize the input, by RMS

The architecture is a plain decoder-only Transformer with three swaps, and LLaMA invented none of them. Each was already in use: pre-normalization from GPT-3, the SwiGLU activation from PaLM, rotary position embeddings from GPT-Neo. The value is in the selection and the careful tuning, not in novelty. Take them one at a time. The first is about normalization, and it has two parts: where you normalize, and how.

Where, first. A Transformer sublayer is wrapped in a residual connection: the block computes something and adds it to its input. The original Transformer normalized the output of that addition, a layout called post-normalization, and it has a quirk: it will not train unless you ramp the learning rate up slowly over the first few thousand steps. Pre-normalization moves the normalizer inside the residual branch, normalizing the input to the block and leaving the residual path itself a clean, untouched identity. That small rearrangement is what lets deep Transformers train stably without the warmup babysitting, and it is the layout GPT-3 used and LLaMA inherits.

How, second. The usual normalizer is LayerNorm, which does two things to a vector: it subtracts the mean (re-centering) and divides by the standard deviation (re-scaling), then applies a learned gain and bias. RMSNorm keeps only the rescaling. It divides by the root mean square and applies a gain, with no mean subtraction and no bias:

xˉi=xiRMS(x)gi,RMS(x)=1njxj2\bar{x}_i = \frac{x_i}{\operatorname{RMS}(x)}\, g_i, \quad \operatorname{RMS}(x) = \sqrt{\tfrac{1}{n}\textstyle\sum_{j} x_j^2}
(2)

The bet behind RMSNorm is that the part of LayerNorm doing the work is the rescaling, and that the re-centering was along for the ride. The argument, which the RMSNorm authors support empirically rather than prove, is that bounded activation magnitudes are what keep a long chain of layers trainable, so once the root mean square is pinned and activations can neither blow up nor vanish, subtracting the mean buys little: re-centering only shifts every component by a common offset, a change the rescale mostly neutralizes anyway. Drop it and you save a pass over the vector and a parameter, with no measured loss in quality. The figure makes the mechanical difference visible. Add an offset to every component of the input and LayerNorm erases it completely, because subtracting the mean removes any constant shift. RMSNorm lets it through, because it never looks at the mean:

Figure 3 · RMSNorm drops the mean
+0.80
The same vector under LayerNorm and RMSNorm. Add a constant to every component and LayerNorm's output does not budge, because it subtracts the mean. RMSNorm's output shifts with it, because it only divides by the root mean square. RMSNorm keeps what LayerNorm throws away, and costs one statistic less to compute.

The pattern recurs through the architecture: a standard component, stripped of the part that was not earning its keep. Nothing here is dramatic on its own. Stacked up across a model trained for weeks, the savings are real and the quality holds.

A smoother, gated activation

The second swap is inside the feed-forward network, the two-layer block that sits after attention in every Transformer. The standard version projects up, applies a ReLU, and projects back down. ReLU is a hard switch: it passes a value through if it is positive and zeroes it otherwise. LLaMA replaces this with SwiGLU, which changes two things at once: the shape of the nonlinearity, and the fact that there are now two paths that multiply.

The shape first. Instead of ReLU, SwiGLU uses Swish (also called SiLU), the function zσ(z)z\,\sigma(z). It looks like a softened ReLU: smooth everywhere, and instead of a hard floor at zero it dips slightly negative around z1.3z \approx -1.3 before climbing. That smoothness gives cleaner gradients, and the small negative region lets a little signal through where ReLU would have killed it. The two shapes differ where it matters:

Figure 4 · ReLU versus Swish
-0.9
ReLU is a hard switch: zero for negatives, the input itself for positives. Swish, zσ(z)z\,\sigma(z), is a smooth version that dips a little below zero near z1.3z\approx-1.3 and is differentiable everywhere. It matches ReLU for large positive inputs.

The gating is the other half, the "GLU" in the name. Rather than running one projection through the activation, the block makes two projections of the input. One is passed through Swish and acts as a gate; the other is the value, and they are multiplied elementwise before the final projection down:

FFN(x)=(Swish(xW)xV)W2\operatorname{FFN}(x) = \big(\operatorname{Swish}(xW) \,\odot\, xV\big)\,W_2
(3)

The gate decides, per coordinate, how much of the value to let through. That is more expressive than a single activation, and it costs a third weight matrix: a standard feed-forward block has two (WW up and W2W_2 down), SwiGLU has three (WW, VV, W2W_2). To keep the parameter count matched against a standard block, LLaMA shrinks the hidden width. A plain feed-forward block widens to 4d4d; LLaMA's SwiGLU block widens only to 8d/32.67d8d/3 \approx 2.67d, the one number people most often get wrong: the hidden width is 8d/38d/3, not 4d4d, and the two-thirds factor is exactly the discount that pays for the extra matrix.

That intuition, a learned per-coordinate gate being more flexible than a fixed nonlinearity, is a hand-wave, not a derivation; nobody can say why gating wins. The paper that introduced these variants tested a pile of them, found the gated ones came out ahead, and declined to explain it, attributing their success, in the author's words, "to divine benevolence." LLaMA uses it because it measures better, which is the only argument on offer.

Position as rotation

The third swap is how the model knows the order of the words, and it is the prettiest idea in the paper. Attention on its own is blind to order: it treats the input as a bag of tokens, so without extra information "dog bites man" and "man bites dog" look identical. The original Transformer fixed this by adding a fixed position vector to each token at the very bottom of the network. Rotary position embeddings, RoPE, encode position a different way: by rotating the query and key vectors.

The construction splits each query and key vector into two-dimensional pairs. For the token at position mm, pair ii is rotated by an angle mθim\,\theta_i, where the per-pair frequency θi=100002i/d\theta_i = 10000^{-2i/d} runs from fast for the first pairs to slow for the last:

Rm(i)=(cosmθisinmθisinmθicosmθi)R_m^{(i)} = \begin{pmatrix} \cos m\theta_i & -\sin m\theta_i \\ \sin m\theta_i & \cos m\theta_i \end{pmatrix}
(4)

On its own that spins each vector by an amount fixed by its absolute position, which does not obviously help. What happens in the attention score is the point. Attention compares a query at position mm with a key at position nn through their dot product, and a rotation preserves angles, so the dot product of two rotated vectors depends only on the angle between them. Turn the query by mθm\theta and the key by nθn\theta, and the angle between them shifts by (mn)θ(m-n)\theta, the difference of the two turns and nothing else:

Rmq,  Rnk  =  q,  Rnmk\langle R_m\,q,\; R_n\,k\rangle \;=\; \langle q,\; R_{\,n-m}\,k\rangle
(5)

So each vector is turned by its absolute position, yet the score reads out only the relative offset. The figure makes this physical. The two sliders rotate a query (teal) and a key (amber) on a dial. One slider sets their offset mnm-n, which opens or closes the gap and changes the score. The other shifts both positions together: the vectors spin, but the gap between them, and the score, do not move. Moving that second slider rotates everything while the score stays frozen:

Figure 5 · the score sees only the offset
2
s = 0
RoPE turns the query and key by angles set by their positions. The attention score is the cosine of the angle between them, which is the content angle plus (mn)θ(m-n)\,\theta. Change the offset and the score moves. Shift both positions together and the vectors spin but the score holds: only the relative offset matters.

Two things follow from the construction. Because the pairs turn at different frequencies, the pairs that turn fast encode fine, local position and the pairs that turn slowly encode coarse, long-range position, the same split a Fourier series uses. A fast pair distinguishes neighbors, a slow pair distinguishes paragraphs, and the geometric spread of θi\theta_i covers every distance scale in between with some pair. The split into two-dimensional pairs is not decoration either: a 2D rotation is exactly the operation that makes the score in eq (5) read out only the offset mnm-n. Stack the pairs and two tokens far apart accumulate a large, scrambled set of relative angles, so their rotated dot product tends to shrink and distant attention naturally weakens. And because the rotation acts on the query and key directly, RoPE is applied inside every attention layer, not added once at the input the way the old sinusoidal encoding was. Position is baked into the comparison itself, wherever a comparison happens.

Those are the three pieces. Put back together, a LLaMA block is short to write: normalize the input by RMS, run attention with RoPE-rotated queries and keys under a causal mask, add the result back, normalize again, run the SwiGLU feed-forward, add again.

# one LLaMA transformer block: pre-norm, RoPE attention, SwiGLU FFN
h   = x + attention(rms_norm(x))     # normalize input, add result back
out = h + swiglu_ffn(rms_norm(h))    # same shape in, same shape out
#
# attention(z):
#   q, k, v = z @ Wq, z @ Wk, z @ Wv
#   q, k    = rope(q), rope(k)        # rotate by position, every layer
#   s       = softmax(causal(q @ k.T) / sqrt(d_head))
#   return (s @ v) @ Wo
#
# swiglu_ffn(z):
#   return (silu(z @ W) * (z @ V)) @ W2    # hidden width 8d/3

Paying for attention once

Training the 65B model took 2,048 A100 GPUs running for about 21 days. Two engineering choices made that tractable, and both are about memory, because on a large model memory, not arithmetic, is the wall you hit first.

The first is in attention. The naive way to compute attention builds the full n×nn \times n table of scores between every pair of positions, which for a long sequence is enormous and most of which you throw away. LLaMA uses a memory-efficient attention implementation (from the xformers library, in the FlashAttention family) that works in tiles, streaming keys and values past each query in small blocks and accumulating the softmax as it goes, so the full table never exists in memory at once. And because a language model is causal, a token may attend only to tokens before it, so the entire upper half of that table is masked out, and the blocks that fall entirely in that half contribute nothing and are skipped rather than computed. The figure is that triangle: the live lower triangle is real work, the dark upper half is work that is skipped entirely.

Figure 6 · half the grid is never computed
i = 5
The attention grid: row ii is a query, column jj a key it might use. A causal model forbids attending to the future, so everything above the diagonal is masked. The efficient implementation never materializes the full grid and never computes the masked half, so attention costs about half of the naive amount in memory.

The second choice is activation checkpointing. To compute gradients, the backward pass needs the activations from the forward pass, and storing all of them costs memory in proportion to depth. The trick is two-sided: keep what is expensive to recompute and cheap to store, the outputs of the big matrix multiplies, and throw the rest away, rebuilding the cheap elementwise pieces around them on the fly during the backward pass for almost nothing. LLaMA went a step further than the usual framework support and hand-wrote the Transformer's backward pass, instead of relying on automatic differentiation, so it could control exactly which activations were saved and which were recomputed. That trades a little extra compute for a large cut in memory, which is the right trade when memory is the binding constraint. With these in place, plus model and sequence parallelism across the GPUs, the 65B sustained about 380 tokens per second per GPU.

What a 13B buys you

The numbers land where the bet said they would. LLaMA-13B is at or above GPT-3 (175B) on most benchmarks while being roughly a thirteenth of the size, and the 65B is competitive with the best models of its day, Chinchilla-70B and PaLM-540B. The two LLaMA bars are teal:

Figure 7 · the family against the field
HellaSwag
Five models per benchmark. LLaMA-13B and 65B in teal; GPT-3 175B, Chinchilla-70B and PaLM-540B in amber (a dash where a paper did not report it). LLaMA-13B clears GPT-3 175B on most benchmarks; LLaMA-65B leads on several commonsense benchmarks but falls behind Chinchilla and PaLM on MMLU.

The 13B-beats-175B line is the one to read carefully, because it is true and also flattering. GPT-3, from 2020, was trained long before Chinchilla and is badly undertrained for its size: 175B parameters on only about 300B tokens, fewer than two tokens per parameter. LLaMA-13B, trained on a trillion tokens, is a much better model per parameter, so it clears a 175B model that was never trained properly. What it does not do is beat the well-trained models. That job falls to the 65B, and even it does not win cleanly. On commonsense reasoning the 65B edges Chinchilla-70B and PaLM-540B on most benchmarks; on the broad knowledge test MMLU it trails both, scoring 63.4 against Chinchilla's 67.5 and PaLM's 69.3. So the abstract's "competitive with the best" means exactly that, not "beats the best."

Where the 65B does reach the top is closed-book question answering: on NaturalQuestions and TriviaQA it sets the best numbers among the models compared, which is a strong result for a 65B model against 175B-to-540B competitors. And the practical headline still lands: because the models are small, the 13B runs inference on a single V100, and the 65B on a single high-memory GPU. That is the bet cashing out. You spent extra training to shrink the model, and now the model is small enough to actually use.

Where it falls short

The paper is unusually candid about the gaps, and they are worth keeping in view. The MMLU shortfall is the clearest. The authors attribute it to data: their books-and-papers slice is about 177GB, where Chinchilla, Gopher and PaLM trained on up to 2TB of books, and a broad academic exam rewards exactly that kind of text. The open-data constraint, which is the point of the project, has a cost on knowledge-heavy benchmarks.

On safety the results are sobering rather than reassuring. Measured toxicity rises with model size rather than falling, so the bigger LLaMA models generate more toxic text under the same prompts, not less. The model reproduces social biases from its web data, most sharply around religion, and on the TruthfulQA benchmark it is more truthful than GPT-3 but still wrong often enough to be a reliable source of confident falsehoods. None of this is unique to LLaMA. It is the standard inheritance of training on the open web, and the paper reports it plainly instead of burying it.

There is a real environmental cost too. Training the 65B alone took about a million GPU-hours and an estimated 173 tonnes of CO2-equivalent; the whole family came to roughly 1,000 tonnes. The argument the authors make for releasing the weights is partly this: if others can download the model instead of retraining it, the one-time cost is amortized rather than repeated. That is the same logic as the inference bet, applied to the training itself.

A short coda on what came next, because it is the real significance. A little instruction finetuning, a single run following the recipe of a contemporary paper, lifted the 65B's MMLU from 63.4 to 68.9; the base models had more to give than their raw benchmark numbers showed. More importantly, the weights got out, and the recipe was reproducible, so a whole generation of open models and tools grew directly on top of LLaMA. The narrow technical bet, train a small model far past compute-optimal so it is cheap to run, turned out to be exactly the property that makes a model worth building on. The thing that made LLaMA cheap to serve is the thing that made it the foundation everyone else stood on.

Provenance Verified against primary literature
Chinchilla (2022)Compute-optimal scaling, ~20 tokens per parameter; LLaMA reuses it but optimizes inference cost.
RMSNorm (2019)Normalize by root mean square, dropping LayerNorm's mean subtraction and bias.
SwiGLU (2020)A gated feed-forward activation; LLaMA shrinks the hidden width to 8d/3 to match parameters.
RoFormer / RoPE (2021)Position by rotation, so the attention score reads out relative position.
correctionTwo readings to resist. LLaMA does not refute Chinchilla; it reuses the same law and optimizes a different objective (deployment cost, where inference dominates). And 'LLaMA-13B beats GPT-3 175B' beats a 2020, badly-undertrained baseline, not a compute-optimal model; the 65B only ties the best models and trails them on MMLU. We also fix the most-quoted detail: the SwiGLU hidden width is 8d/3, not 4d.

Questions you might still have

?

Did LLaMA prove Chinchilla wrong?
No. Chinchilla minimizes training compute and gives ~20 tokens per parameter. LLaMA reuses that law but minimizes the cost of deploying a model, where inference is paid on every query and scales with size. A small model trained far longer is worse per training-FLOP, exactly as Chinchilla says, but cheaper to serve, which is the term Chinchilla never modeled.

?

If RoPE rotates every vector by its absolute position, how does the score end up relative?
The attention score is a dot product, which depends only on the angle between the two vectors. Rotating the query by m·θ and the key by n·θ leaves a relative angle that depends only on m − n. Spin both together by the same amount and the gap, and the score, do not change.

?

Why does LLaMA-13B beat GPT-3 175B if it is so much smaller?
GPT-3 (2020) was trained before Chinchilla on under two tokens per parameter, badly undertrained for its size. LLaMA-13B saw 77 tokens per parameter. The 13B beats a weak 175B; it does not beat the well-trained models its own size or larger, which is the 65B’s job.

?

Why is the feed-forward width 8d/3 instead of 4d?
SwiGLU uses three weight matrices (gate, value, down) where a plain feed-forward block uses two. Shrinking the hidden width from 4d to 8d/3 ≈ 2.67d gives back the parameters the extra matrix would have cost, so the block matches a standard one in size.

Footnotes & further reading

  1. The paper: Touvron, Lavril, Izacard, Martinet, Lachaux, Lacroix, Rozière, Goyal, Hambro, Azhar, Rodriguez, Joulin, Grave, Lample, LLaMA: Open and Efficient Foundation Language Models (Meta AI, 2023). Code.
  2. The scaling law LLaMA departs from: Hoffmann et al., Training Compute-Optimal Large Language Models (Chinchilla). The inference-aware reconciliation came later: Sardana et al., Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws.
  3. Normalization: Zhang & Sennrich, Root Mean Square Layer Normalization (RMSNorm), and Xiong et al., On Layer Normalization in the Transformer Architecture (why pre-norm removes the warmup).
  4. The activation: Shazeer, GLU Variants Improve Transformer (SwiGLU, "divine benevolence"), building on Dauphin et al., Language Modeling with Gated Convolutional Networks (GLU).
  5. Rotary embeddings: Su et al., RoFormer: Enhanced Transformer with Rotary Position Embedding.
  6. Efficient training: Rabe & Staats, Self-attention Does Not Need O(n²) Memory, and Dao et al., FlashAttention; activation checkpointing from Chen et al., Training Deep Nets with Sublinear Memory Cost.