Scaling · LLMs

Training Compute-Optimal Large Language Models

Most big models were trained far too small.

With a fixed compute budget you can build a bigger model or feed it more data, not both. For years the field spent almost everything on size. Chinchilla works out the right split: grow both together, roughly twenty tokens for every parameter.

Explaining the paperTraining Compute-Optimal Large Language ModelsHoffmann, Borgeaud, Mensch, Sifre, et al. · DeepMind · NeurIPS 2022 · arXiv:2203.15556 ↗

Two models can cost the same to train, and one comes out clearly better. What separates them is how the fixed budget was split between model size and data.

In 2020 a paper from OpenAI (Kaplan and colleagues) gave the field a recipe it followed for two years. Make the model bigger. Their scaling law said that when you get more compute, most of it should go into parameters and only a little into training data. So the models grew: GPT-3 at 175 billion parameters, Gopher at 280 billion, Megatron-Turing at 530 billion. And almost all of them were trained on about the same amount of text, around 300 billion tokens. Size went up by a factor of three. Data barely moved.

Chinchilla, from DeepMind, says that recipe was wrong, and backs it with evidence. The authors trained over 400 models, from 70 million to 16 billion parameters, across token counts from 5 billion to 500 billion (each model size paired with a range of token counts, as a fixed budget demands), and asked one question of all of them: given a fixed amount of compute, what is the best way to spend it? Their finding: model size and the number of training tokens should grow in lockstep. Double the model, double the data. The big models everyone was building were the right total cost and the wrong shape. They had too few tokens for their size.

To see why, and to see how you would ever measure such a thing, a few ideas explain it. What compute actually is, and why it is six FLOPs per parameter per token. Why a fixed budget forces a trade. How a single experiment, run at one budget, reveals an optimal model size. How those experiments add up to a formula for the entire loss surface. And how that formula, minimized under the budget, gives the headline: scale both, equally.

Compute is the budget you are actually spending

The thing being rationed should be defined first. When people say a model cost some number of FLOPs, they mean floating-point operations: the raw count of multiplies and adds the hardware had to perform to train it. That number is what your money and your time buy.

For a Transformer this budget has a clean form. Training compute is, to a good approximation, six times the number of parameters times the number of training tokens:

C \approx 6\,N\,D

(1)

where $C$ is the total training FLOPs, $N$ is the parameter count, and $D$ is the number of training tokens. The factor of six is why the budget stays so simple. Pushing one token through the network touches every parameter roughly once, and each touch is a multiply and an add, so the forward pass costs about $2N$ FLOPs per token. Training also needs the backward pass, which computes gradients and costs about twice the forward pass, another $4N$ . The two-to-one ratio has a concrete source: where the forward pass does one matrix product per layer, the backward pass does two, one to get the gradient on that layer's weights and one to pass the error signal down to the layer below, and each of those is about a forward pass's worth of work. That is $2 + 4 = 6$ FLOPs per parameter per token. Multiply by $D$ tokens and you get $6ND$ .

Drag the two knobs below. The bar is the budget, split into its forward and backward parts, and the dashed line is the compute it took to train Gopher. A small model on a lot of data and a big model on a little data can sit at the exact same budget:

Figure 1 · the compute budget

params N71B

tokens D1.4T

Training compute is 2ND for the forward pass plus 4ND for the backward pass, totalling

6ND

. The same budget can be reached by a big model on little data or a small model on lots of data, the same size-versus-data tradeoff the rest of the paper examines.

The $6ND$ rule ignores the part of attention that grows with sequence length, so it is an approximation, and the paper checks it: the authors also compute the exact FLOPs, counting embeddings, attention, and the rest, and find the two agree closely. For Gopher the exact figure is $5.76\times 10^{23}$ FLOPs; the $6ND$ estimate gives $5.04\times 10^{23}$ . Close enough that nothing downstream depends on which you use, and $6ND$ is the one you can do in your head.

# training compute, to a good approximation
def train_flops(N, D):          # N params, D tokens
    fwd = 2 * N * D             # 1 multiply + 1 add per param, per token
    bwd = 2 * fwd              # backward pass ~= 2x the forward pass
    return fwd + bwd           # = 6 * N * D

train_flops(70e9, 1.4e12)      # Chinchilla -> 5.88e23 FLOPs
train_flops(280e9, 300e9)      # Gopher     -> 5.04e23 FLOPs  (~same budget)

The same budget buys size or data, not both

Equation (1) is a constraint, and that constraint turns this into a real decision. Once your budget $C$ is fixed (you know how many chips you have and for how long), the product $N \cdot D$ is fixed too. So the moment you choose a model size, the number of tokens you can afford is decided for you:

D = \frac{C}{6\,N}

(2)

A bigger model leaves less budget for tokens, more tokens force a smaller model, every choice slides along the same curve. The question is no longer how big a model can I afford, it is which point on this curve gives the lowest loss.

There is a tension at both ends. A model that is too small never has the capacity to fit the data well, no matter how much of it you pour in; it underfits. A model that is too big spends its entire budget on parameters and is left with too few tokens to actually learn from; it is undertrained. The loss is bad for opposite reasons at the two extremes, which is the surest sign there is a sweet spot in the middle. The paper finds it three different ways. The cleanest to picture is the second.

Fix the budget, then find the valley

The experiment is called an IsoFLOP profile (iso meaning equal, so equal-FLOP). Pick one compute budget and hold it fixed. Now train a whole family of models at that budget: a tiny one (which, by equation (2), gets to see a huge number of tokens), a small one, a medium one, a large one (which can only afford a few tokens). Every model in the family cost the same to train. Plot each one's final loss against its size.

What comes out is a valley. On the left side the models are too small and underfit; on the right side they are too big for the budget and run out of tokens. A miniature makes the two failures vivid: a tiny model on a huge corpus has nothing left to learn with, its capacity fills and the rest of the text is wasted on it, while a huge model on a sliver of text has nothing left to learn from, most of its parameters are trained on too little data to converge. High loss at both ends, for opposite reasons, guarantees a bottom in between: a single model size that turned this exact budget into the lowest loss. That bottom is the compute-optimal model for that budget.

Drag the budget and watch the valley slide. Each curve is one fixed budget; the amber dots trace the bottom of each valley, the optimal size for each budget:

Figure 2 · IsoFLOP valleys

budget1.9e21 FLOPs

Each curve is one fixed compute budget. Too-small models underfit (left), too-big models run out of tokens (right), and the bottom of the valley is the compute-optimal size. As the budget grows the valley shifts to the right, and the optimum stays near twenty tokens per parameter.

As you drag, the valley really does have a bottom, and it is not at the edge; for every budget there is a definite best size, not just bigger is better. And as the budget grows the valley shifts to the right, toward larger models, but gently: the optimal size and the optimal token count grow at about the same rate. The little readout in the corner shows it plainly: at the bottom of every valley, the ratio of tokens to parameters stays near twenty.

The paper runs this for nine different budgets, fits a parabola to each valley to read off its minimum, and then asks how the optimal size and token count scale with the budget. It writes the answer as two power laws,

N_{\text{opt}} \propto C^{a}, \qquad D_{\text{opt}} \propto C^{b}

(3)

and from the IsoFLOP experiment it finds $a = 0.49$ and $b = 0.51$ . Both close to one half. That is the equal-scaling result, read straight off the valleys: when compute goes up, parameters and tokens should each go up by about the square root of the increase, which keeps their ratio fixed.

A formula for the whole loss surface

The IsoFLOP experiment finds the optimum by brute force, one budget at a time. The third approach is more ambitious. It writes down a single formula that predicts the final loss for any model size and any number of tokens, fits it to all 400-plus runs at once, and then finds the optimum by calculus instead of by search. The formula is three terms:

L(N, D) = E + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}}

(4)

$E$ is the floor, the loss you could never beat even with an infinite model and infinite data: the irreducible entropy of language itself, the part that is unpredictable. The other two terms are regrets. The middle one, $A/N^{\alpha}$ , is the penalty for having a finite model, for being too small; a bigger $N$ shrinks it, because a larger network can represent more, but with diminishing returns set by the exponent $\alpha$ . The last, $B/D^{\beta}$ , is the penalty for finite data, for not having read enough; more tokens $D$ shrink it, because the model has seen more, again with diminishing returns set by $\beta$ .

Fitting (4) to the data (using a robust regression loss, one that downweights outliers, here the noisiest small runs, so they don't distort the fit) gives the constants. The paper reports

E = 1.69, \quad A = 406.4, \quad B = 410.7, \quad \alpha = 0.34, \quad \beta = 0.28

(5)

These are the numbers the paper printed, and a later replication found they do not quite reproduce the paper's own headline; we come back to that. The shape of (4) is the durable contribution, and the shape is what the figure below draws. It is the loss surface over the plane of model size and tokens, in log-log axes. Brighter teal is lower loss. The faint diagonals are constant-budget lines (each one is a constant $N\cdot D = \text{const}$ budget). The amber line is the compute-optimal frontier: for every loss level it marks the cheapest point that reaches it.

Figure 3 · the loss surface

budget1.0e22 FLOPs

The loss surface

L(N,D)

in log-log space; brighter teal is lower loss. Each faint diagonal is one constant-compute budget. The amber frontier threads the lowest-FLOP point on every loss level, and in these axes it is a straight line of slope one: a doubling of parameters calls for doubling the tokens.

The frontier being a straight line in log-log space is the equal-scaling result again, now as geometry. To see why it falls out of (4), minimize the loss along a budget line. Substitute the constraint $D = C/6N$ into (4), take the derivative with respect to $N$ , set it to zero, and the optimum comes out in closed form:

N_{\text{opt}}(C) = G\left(\frac{C}{6}\right)^{a}, \quad D_{\text{opt}}(C) = G^{-1}\left(\frac{C}{6}\right)^{b}, \quad a = \frac{\beta}{\alpha+\beta},\ \ b = \frac{\alpha}{\alpha+\beta}

(6)

The exponents $a$ and $b$ always sum to one (since $C \approx 6ND$ ties them together), so the only question is whether they split evenly. They split evenly exactly when $\alpha \approx \beta$ , that is, when the loss falls off at the same rate whether you add parameters or add tokens. The paper finds $\alpha$ and $\beta$ close (0.34 and 0.28), so $a$ and $b$ land near one half. The reason to scale both equally is that the two regrets in equation (4) shrink at nearly the same speed, so the cheapest way to buy down loss is to pay both off at once.

# given a compute budget C, find the compute-optimal split
def optimal_split(C):
    best = None
    for N in log_range(1e7, 1e12):     # sweep model sizes
        D = C / (6 * N)                # tokens this budget allows
        L = E + A / N**alpha + B / D**beta   # predicted final loss
        if best is None or L < best.loss:
            best = Point(N=N, D=D, loss=L)
    return best                         # D/N comes out near 20

Scale both, equally

All three approaches (the training-curve envelope, the IsoFLOP valleys, and the parametric fit) agree to within a hair. As compute grows, model size and training tokens should grow in equal proportion. Stated as a rule of thumb, the compute-optimal number of training tokens is about twenty times the parameter count. The compute-optimal split for a 1B model is roughly 20B tokens; for a 70B model it is roughly 1.4T.

The figure summarizes the disagreement with the earlier work. Kaplan's law put the exponents at $a = 0.73$ for size and $b = 0.27$ for data: pour most new compute into parameters. Chinchilla puts both near $0.5$ . The gap looks small written down and is enormous in practice. Drag the slider to spend more and more compute and watch the two policies diverge. Under Kaplan the model balloons and the tokens-per-parameter ratio collapses, leaving a giant model with far too few tokens. Under Chinchilla the ratio holds:

Figure 4 · two ways to spend more compute

more compute1k×

Both columns start from the same compute-optimal point and then spend a larger budget. Under Kaplan parameters race ahead of tokens, so the model has too few tokens for its size. Under Chinchilla parameters and tokens grow together, holding the ratio near twenty tokens per parameter.

Why did the earlier study get a different answer? The paper is direct about it. Kaplan's models all used one fixed learning-rate schedule, tuned for a long run, and were then read off at shorter horizons. The schedule the field used is a cosine curve: the learning rate starts high and is dialed smoothly down to a small floor over the planned length of the run. Much of a run's final improvement arrives in that last stretch as the rate approaches the floor, so a run measured mid-decay (before the rate has been wound down) has simply not been given its ending yet, and its loss reads too high. The shortfall is largest for the shortest, most data-light runs, which makes them look worse than they are and nudges the conclusion toward bigger models. Chinchilla instead matches the schedule length to each run, and the bias goes away. The recipe everyone followed for two years rested on a measurement artifact.

Figure 5 · the schedule artifact

measure atΔ ≈ 0.38

A schematic of the artifact, not paper data; the shape carries the argument, not the exact numbers. The same run under a cosine schedule matched to its horizon decays fully and settles lower; under a schedule tuned for a much longer run and read mid-decay, the loss reads high, worst at the shortest horizons. Drag the measurement point; the gap closes as the horizon reaches the schedule's length.

Chinchilla vs Gopher, the controlled test

A scaling law is a prediction, and the way to test a prediction is to build the model it points to. The team took Gopher's exact compute budget, $5.76\times 10^{23}$ FLOPs, and worked out from their own analysis what shape of model it should buy. The compute-optimal shape turned out to be small: somewhere between 40 and 70 billion parameters, trained on well over a trillion tokens, not 280 billion parameters on 300 billion tokens. So they built it. A 70B model trained on 1.4 trillion tokens, the same total cost as Gopher, redistributed. They named it Chinchilla.

The budget check works out. Gopher is $6 \times 280\text{B} \times 300\text{B} \approx 5.0\times 10^{23}$ FLOPs by the $6ND$ rule; Chinchilla is $6 \times 70\text{B} \times 1.4\text{T} \approx 5.9\times 10^{23}$ . The same budget, give or take, split four-to-one the other way. Gopher spent it on size and read 300 billion tokens. Chinchilla is one quarter the size and read about 4.6 times as much text. Twenty tokens per parameter, almost exactly.

With both substituted into the loss formula, it predicts Chinchilla should win: about $1.94$ for Chinchilla against $1.99$ for Gopher. Lower loss for the smaller model, at equal cost. The experiment confirmed it. Chinchilla beat Gopher on the language-modeling loss and on the great majority of downstream tasks, and it beat GPT-3, AI21's 178-billion-parameter Jurassic-1, and the 530-billion-parameter Megatron-Turing too, all of which it is smaller than. On MMLU, a broad multiple-choice exam across academic subjects, Chinchilla reached 67.5% against Gopher's 60%, a jump of more than seven points from a model with a quarter of the parameters.

And the smaller model keeps paying off. Inference cost scales with parameter count, so a model that is four times smaller is roughly four times cheaper to run, every single time it is used, forever.

What it changed, and the one table that was wrong

It moved the field's default. Before Chinchilla, more compute meant a bigger model. After Chinchilla, more compute meant a bigger model and more data, in step, and the twenty-to-one rule of thumb became the starting point for nearly every model that followed. The undertrained giants of 2021 look, in hindsight, like the last models built under a mistaken law.

There is one important wrinkle. In 2024 a replication (Besiroglu and colleagues at Epoch AI) re-extracted the data from the paper's own figures, re-fit the parametric model of equation (4), and found that the specific constants the paper printed in equation (5) do not actually reproduce its conclusion: minimized under the budget, those numbers imply roughly seventy tokens per parameter, not twenty, contradicting the paper's own first two approaches and the recipe it used to build Chinchilla. The reported confidence intervals were also implausibly tight, tight enough to need hundreds of thousands of runs rather than a few hundred. Re-fitting the data restored the familiar number, about twenty tokens per parameter, in line with everything else.

So the headline is solid and one table in the appendix is not. The two empirical approaches (the IsoFLOP valleys and the training-curve envelope) and the Chinchilla model itself all point to scaling both equally at about twenty to one. The parametric formula is a real and useful object; the particular constants the paper happened to print for it were off. There is a structural reason the constants are this touchy: equation (6) feeds $\alpha$ and $\beta$ through the ratios $a = \beta/(\alpha+\beta)$ and $b = \alpha/(\alpha+\beta)$ , and the two fitted exponents are close to each other, so the recommended split hangs on a small difference between two similar numbers, and a small fitting error in either one moves the implied tokens-per-parameter a long way. The figures in this piece use the corrected constants, which is why their valleys sit at twenty, not seventy. It is a good reminder that a paper's conclusion and any one of its fitted numbers are separate things, and both deserve checking.

The authors name the limits plainly. The analysis rests on a power-law fit, and at the very largest budgets they see a slight bend in the curve, which hints the truly optimal models might be even smaller than the straight line predicts. There were only two runs at full scale, Chinchilla and Gopher, so the decisive test is a single A/B comparison, convincing but not a sweep. Every run saw each token roughly once, so nothing here speaks to what happens when you train for many passes over a smaller corpus. And the trillions of tokens required by the law have to come from somewhere, which turns data quality and data collection into first-class problems rather than afterthoughts.

The argument follows directly from the budget. Compute is six FLOPs per parameter per token, so a budget fixes the product of size and data. A fixed budget has a best split, visible as the bottom of an IsoFLOP valley. Fit the loss surface and that split has a closed form, with size and tokens carrying nearly equal exponents. So when more compute arrives, grow both, about twenty tokens for every parameter. The giants of 2021 were the right price and the wrong shape. They had too few tokens for their size.

Provenance Verified against primary literature

Chinchilla (2022)Hoffmann et al.: the three approaches, the 6ND budget, the parametric loss L = E + A/Nᵃ + B/Dᵇ, and the 70B / 1.4T Chinchilla model.

Kaplan et al. (2020)The earlier scaling law this paper corrects: Nₒₚₜ ∝ C^0.73, Dₒₚₜ ∝ C^0.27 (scale mostly size).

Gopher (2021)Rae et al.: the 280B model trained on 300B tokens that Chinchilla matches in compute and beats in quality.

Epoch AI replication (2024)Besiroglu et al. re-fit Approach 3 from the paper’s own figure data: the figures use their corrected constants.

correctionThe paper’s Approach 3 prints fitted constants (A=406.4, B=410.7, α=0.34, β=0.28) whose closed-form optimum implies roughly 70 tokens per parameter, contradicting its own Approaches 1 and 2 and the 20-to-1 rule used to build Chinchilla. A 2024 replication (Epoch AI, arXiv:2404.10102) re-fit the data and recovered about 20. The figures here use the corrected constants so they match the headline; the formula in the prose shows the published numbers, with this discrepancy called out.

Questions you might still have

If the answer is "20 tokens per parameter," why train any model that isn’t exactly compute-optimal?
Because compute-optimal only counts training. A model is also paid for at inference, every time it is used. A smaller model trained past its compute-optimal point is more expensive to train but cheaper to serve forever after, which is why production models are now routinely trained well beyond 20 tokens per parameter.

Where does the factor of 6 in C = 6ND come from?
Two FLOPs (a multiply and an add) per parameter per token on the forward pass, and the backward pass costs about twice the forward, so 2 + 4 = 6. It ignores attention’s sequence-length term, which is small for these model sizes, so it is an approximation the paper checks against an exact count.

Does the famous parametric formula actually reproduce the 20-to-1 rule?
Not with the constants the paper printed. A 2024 replication showed those numbers imply roughly 70 tokens per parameter, at odds with the paper’s own first two methods. Re-fitting the data restores about 20. The headline conclusion holds; one set of fitted constants in the paper does not.

Footnotes & further reading

The paper: Hoffmann, Borgeaud, Mensch, Buchatskaya, Cai, Rutherford, et al., Training Compute-Optimal Large Language Models (DeepMind, NeurIPS 2022).
The earlier scaling law this paper revises: Kaplan, McCandlish, et al., Scaling Laws for Neural Language Models (OpenAI, 2020), which recommended putting most new compute into model size.
Gopher, the 280B baseline Chinchilla matches in compute and beats in quality: Rae et al., Scaling Language Models: Methods, Analysis & Insights from Training Gopher.
The replication that re-fit Approach 3 and flagged the inconsistent constants: Besiroglu, Erdil, Barnett, You, Chinchilla Scaling: A Replication Attempt (Epoch AI, 2024). The corrected constants used in the figures here come from this work.
Why production models are now trained well past the compute-optimal point: Sardana et al., Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws.
MMLU, the multiple-choice exam benchmark Chinchilla set a record on: Hendrycks et al., Measuring Massive Multitask Language Understanding.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.