VerifiedarXiv:2203.1555619 min
Scaling · LLMs

Training Compute-Optimal Large Language Models

Most big models were trained far too small.

With a fixed compute budget you can build a bigger model or feed it more data, not both. For years the field spent almost everything on size. Chinchilla works out the right split, and the answer is to grow both together, roughly twenty tokens for every parameter.

Explaining the paperTraining Compute-Optimal Large Language ModelsHoffmann, Borgeaud, Mensch, Sifre, et al. · DeepMind · NeurIPS 2022 · arXiv:2203.15556

You have a fixed pile of compute. Do you buy a bigger model, or more data to train it on?

In 2020 a paper from OpenAI (Kaplan and colleagues) gave the field a recipe it followed for two years. Make the model bigger. Their scaling law said that when you get more compute, most of it should go into parameters and only a little into training data. So the models grew: GPT-3 at 175 billion parameters, Gopher at 280 billion, Megatron-Turing at 530 billion. And almost all of them were trained on about the same amount of text, around 300 billion tokens. Size went up by a factor of three. Data barely moved.

Chinchilla, from DeepMind, says that recipe was wrong, and it says so with receipts. The authors trained over 400 models, from 70 million to 16 billion parameters, across token counts from 5 billion to 500 billion, and asked one question of all of them: given a fixed amount of compute, what is the best way to spend it? The answer is that model size and the number of training tokens should grow in lockstep. Double the model, double the data. The big models everyone was building were the right total cost and the wrong shape. They were starved for data.

To see why, and to see how you would ever measure such a thing, we build a short tower. What compute actually is, and why it is six FLOPs per parameter per token. Why a fixed budget forces a trade. How a single experiment, run at one budget, reveals an optimal model size. How those experiments add up to a formula for the whole loss surface. And how that formula, minimized under the budget, gives the headline: scale both, equally.

Compute is the budget you are actually spending

Start with the thing being rationed. When people say a model cost some number of FLOPs, they mean floating-point operations: the raw count of multiplies and adds the hardware had to perform to train it. That number is what your money and your time buy. It is the budget.

The useful fact is that for a Transformer this budget has a clean form. Training compute is, to a good approximation, six times the number of parameters times the number of training tokens:

C6NDC \approx 6\,N\,D(1)

where CC is the total training FLOPs, NN is the parameter count, and DD is the number of training tokens. The factor of six is worth pinning down, because it is the whole reason the budget is so simple. Pushing one token through the network touches every parameter roughly once, and each touch is a multiply and an add, so the forward pass costs about 2N2N FLOPs per token. Training also needs the backward pass, which computes gradients and costs about twice the forward pass, another 4N4N. That is 2+4=62 + 4 = 6 FLOPs per parameter per token. Multiply by DD tokens and you get 6ND6ND.

Drag the two knobs below. The bar is the budget, split into its forward and backward parts, and the dashed line is the compute it took to train Gopher. Notice that a small model on a lot of data and a big model on a little data can sit at the exact same budget:

Figure 1 · the compute budget
71B
1.4T
Training compute is 2ND for the forward pass plus 4ND for the backward pass, totalling 6ND6ND. The same budget can be reached by a big model on little data or a small model on lots of data, which is exactly the trade the rest of the paper is about.

One caveat the paper is careful about. The 6ND6ND rule ignores the part of attention that grows with sequence length, so it is an approximation. The authors also compute the exact FLOPs, counting embeddings, attention, and the rest, and find the two agree closely. For Gopher the exact figure is 5.76×10235.76\times 10^{23} FLOPs; the 6ND6ND estimate gives 5.04×10235.04\times 10^{23}. Close enough that nothing downstream depends on which you use, and 6ND6ND is the one you can do in your head.

# training compute, to a good approximation
def train_flops(N, D):          # N params, D tokens
    fwd = 2 * N * D             # 1 multiply + 1 add per param, per token
    bwd = 2 * fwd              # backward pass ~= 2x the forward pass
    return fwd + bwd           # = 6 * N * D

train_flops(70e9, 1.4e12)      # Chinchilla -> 5.88e23 FLOPs
train_flops(280e9, 300e9)      # Gopher     -> 5.04e23 FLOPs  (~same budget)

The same budget buys size or data, not both

Equation (1) is a constraint, and a constraint is what makes this a real decision. Once your budget CC is fixed (you know how many chips you have and for how long), the product NDN \cdot D is fixed too. So the moment you choose a model size, the number of tokens you can afford is decided for you:

D=C6ND = \frac{C}{6\,N}(2)

Spend on a bigger model and there is less budget left for tokens, so you train it on less text. Spend on more tokens and you have to shrink the model to pay for them. Every choice slides along the same curve. The question is no longer how big a model can I afford, it is which point on this curve gives the lowest loss.

And there is a genuine tension at both ends. A model that is too small never has the capacity to fit the data well, no matter how much of it you pour in; it underfits. A model that is too big eats the whole budget on parameters and is left with too few tokens to actually learn from; it is undertrained. The loss is bad for opposite reasons at the two extremes, which is the surest sign there is a sweet spot in the middle. The paper finds it three different ways. The cleanest to picture is the second.

Fix the budget, then find the valley

Here is the experiment, called an IsoFLOP profile (iso meaning equal, so equal-FLOP). Pick one compute budget and hold it fixed. Now train a whole family of models at that budget: a tiny one (which, by equation (2), gets to see a huge number of tokens), a small one, a medium one, a large one (which can only afford a few tokens). Every model in the family cost the same to train. Plot each one's final loss against its size.

What comes out is a valley. On the left side the models are too small and underfit; the loss is high. On the right side the models are too big for the budget, run out of tokens, and the loss is high again. In between there is a clear bottom, a single model size that turned this exact budget into the lowest loss. That bottom is the compute-optimal model for that budget.

Drag the budget and watch the valley slide. Each curve is one fixed budget; the amber dots trace the bottom of each valley, the optimal size for each budget:

Figure 2 · IsoFLOP valleys
1.9e21 FLOPs
Each curve is one fixed compute budget. Too-small models underfit (left), too-big models starve for tokens (right), and the bottom of the valley is the compute-optimal size. As the budget grows the valley shifts toward larger models, and the optimum stays near twenty tokens per parameter.

Two things are worth noticing as you drag. First, the valley really does have a bottom, and it is not at the edge; for every budget there is a definite best size, not just bigger is better. Second, as the budget grows the whole valley marches to the right, toward larger models, but it does so gently. The optimal size grows, the optimal token count grows, and they grow at about the same rate. The little readout in the corner says it plainly: at the bottom of every valley, the ratio of tokens to parameters stays near twenty.

The paper runs this for nine different budgets, fits a parabola to each valley to read off its minimum, and then asks how the optimal size and token count scale with the budget. It writes the answer as two power laws,

NoptCa,DoptCbN_{\text{opt}} \propto C^{a}, \qquad D_{\text{opt}} \propto C^{b}(3)

and from the IsoFLOP experiment it finds a=0.49a = 0.49 and b=0.51b = 0.51. Both close to one half. That is the equal-scaling result, read straight off the valleys: when compute goes up, parameters and tokens should each go up by about the square root of the increase, which keeps their ratio fixed.

A formula for the whole loss surface

The IsoFLOP experiment finds the optimum by brute force, one budget at a time. The third approach is more ambitious. It writes down a single formula that predicts the final loss for any model size and any number of tokens, fits it to all 400-plus runs at once, and then finds the optimum by calculus instead of by search. The formula is three terms:

L(N,D)=E+ANα+BDβL(N, D) = E + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}}(4)

Read each piece. EE is the loss you could never beat even with an infinite model and infinite data: the irreducible entropy of language itself, the part that is genuinely unpredictable. The middle term A/NαA/N^{\alpha} is the penalty for having a finite model; a bigger NN shrinks it, because a larger network can represent more, but with diminishing returns set by the exponent α\alpha. The last term B/DβB/D^{\beta} is the penalty for finite data; more tokens DD shrink it, because the model has seen more, again with diminishing returns set by β\beta. Loss is the floor plus two regrets, one for being too small and one for not having read enough.

Fitting (4) to the data (with a robust loss that downweights the noisiest small runs) gives the constants. The paper reports

E=1.69,A=406.4,B=410.7,α=0.34,β=0.28E = 1.69, \quad A = 406.4, \quad B = 410.7, \quad \alpha = 0.34, \quad \beta = 0.28(5)

These are the numbers the paper printed, and they are worth treating with a little care; we come back to them in a moment, because a later replication found they do not quite reproduce the paper's own headline. The shape of (4) is the durable contribution, and the shape is what the figure below draws. It is the loss surface over the plane of model size and tokens, in log-log axes. Brighter teal is lower loss. The faint diagonals are constant-budget lines (each one is an ND=constN\cdot D = \text{const} trade from the last section). The amber line is the compute-optimal frontier: for every loss level it marks the cheapest point that reaches it.

Figure 3 · the loss surface
1.0e22 FLOPs
The loss surface L(N,D)L(N,D) in log-log space; brighter teal is lower loss. Each faint diagonal is one constant-compute budget. The amber frontier threads the lowest-FLOP point on every loss level, and in these axes it is a straight line of slope one: a doubling of parameters wants a doubling of tokens.

The frontier being a straight line in log-log space is the equal-scaling result again, now as geometry. To see why it falls out of (4), minimize the loss along a budget line. Substitute the constraint D=C/6ND = C/6N into (4), take the derivative with respect to NN, set it to zero, and the optimum comes out in closed form:

Nopt(C)=G(C6)a,Dopt(C)=G1(C6)b,a=βα+β,  b=αα+βN_{\text{opt}}(C) = G\left(\frac{C}{6}\right)^{a}, \quad D_{\text{opt}}(C) = G^{-1}\left(\frac{C}{6}\right)^{b}, \quad a = \frac{\beta}{\alpha+\beta},\ \ b = \frac{\alpha}{\alpha+\beta}(6)

The exponents aa and bb always sum to one (since C6NDC \approx 6ND ties them together), so the only question is whether they split evenly. They split evenly exactly when αβ\alpha \approx \beta, that is, when the loss falls off at the same rate whether you add parameters or add tokens. The paper finds α\alpha and β\beta close (0.34 and 0.28), so aa and bb land near one half. The reason to scale both equally is that the two regrets in equation (4) shrink at nearly the same speed, so the cheapest way to buy down loss is to pay both off at once.

# given a compute budget C, find the compute-optimal split
def optimal_split(C):
    best = None
    for N in log_range(1e7, 1e12):     # sweep model sizes
        D = C / (6 * N)                # tokens this budget allows
        L = E + A / N**alpha + B / D**beta   # predicted final loss
        if best is None or L < best.loss:
            best = Point(N=N, D=D, loss=L)
    return best                         # D/N comes out near 20

Scale both, equally

All three approaches (the training-curve envelope, the IsoFLOP valleys, and the parametric fit) agree to within a hair. As compute grows, model size and training tokens should grow in equal proportion. Stated as a rule of thumb, the compute-optimal number of training tokens is about twenty times the parameter count. A 1B model wants roughly 20B tokens; a 70B model wants roughly 1.4T.

This is the whole disagreement with the earlier work in one picture. Kaplan's law put the exponents at a=0.73a = 0.73 for size and b=0.27b = 0.27 for data: pour most new compute into parameters. Chinchilla puts both near 0.50.5. The gap looks small written down and is enormous in practice. Drag the slider to spend more and more compute and watch the two policies diverge. Under Kaplan the model balloons and the tokens-per-parameter ratio collapses, leaving a giant model starved for data. Under Chinchilla the ratio holds:

Figure 4 · two ways to spend more compute
1k×
Both columns start from the same compute-optimal point and then spend a larger budget. Under Kaplan parameters race ahead of tokens, so the model starves for data. Under Chinchilla parameters and tokens grow together, holding the ratio near twenty tokens per parameter.

Why did the earlier study get a different answer? The paper is direct about it. Kaplan's models all used one fixed learning-rate schedule, tuned for a long run, and were then read off at shorter horizons. A cosine schedule that has not finished decaying gives a loss estimate that is too high, which makes shorter (more data-light) runs look worse than they are, which nudges the conclusion toward bigger models. Chinchilla instead matches the schedule length to each run, and the bias goes away. The recipe everyone followed for two years rested on a measurement artifact.

Chinchilla vs Gopher, the controlled test

A scaling law is a prediction, and the right way to test a prediction is to bet on it. The team took Gopher's exact compute budget, 5.76×10235.76\times 10^{23} FLOPs, and asked their own analysis what shape of model it should buy. The answer came back small: somewhere between 40 and 70 billion parameters, trained on well over a trillion tokens, not 280 billion parameters on 300 billion tokens. So they built it. A 70B model trained on 1.4 trillion tokens, the same total cost as Gopher, redistributed. They named it Chinchilla.

Run the budget check. Gopher is 6×280B×300B5.0×10236 \times 280\text{B} \times 300\text{B} \approx 5.0\times 10^{23} FLOPs by the 6ND6ND rule; Chinchilla is 6×70B×1.4T5.9×10236 \times 70\text{B} \times 1.4\text{T} \approx 5.9\times 10^{23}. The same budget, give or take, split four-to-one the other way. Gopher spent it on size and read 300 billion tokens. Chinchilla is one quarter the size and read about 4.6 times as much text. Twenty tokens per parameter, almost exactly.

Plug both into the loss formula and it predicts Chinchilla should win: about 1.941.94 for Chinchilla against 1.991.99 for Gopher. Lower loss for the smaller model, at equal cost. The experiment confirmed it. Chinchilla beat Gopher on the language-modeling loss and on the great majority of downstream tasks, and it beat GPT-3, Jurassic-1, and the 530-billion-parameter Megatron-Turing too, all of which it is smaller than. On MMLU, a broad multiple-choice exam across academic subjects, Chinchilla reached 67.5% against Gopher's 60%, a jump of more than seven points from a model with a quarter of the parameters.

And the smaller model keeps paying off. Inference cost scales with parameter count, so a model that is four times smaller is roughly four times cheaper to run, every single time it is used, forever. The compute you save by not over-sizing the model is not a one-time training discount. It is a permanent tax cut on every query.

So what does it actually do

It moved the whole field's default. Before Chinchilla, more compute meant a bigger model. After Chinchilla, more compute meant a bigger model and more data, in step, and the twenty-to-one rule of thumb became the starting point for nearly every model that followed. The undertrained giants of 2021 look, in hindsight, like the last models built under a mistaken law.

There is a sharp twist worth being honest about, because it is a clean lesson in reading papers carefully. In 2024 a replication (Besiroglu and colleagues at Epoch AI) re-extracted the data from the paper's own figures and re-fit the parametric model of equation (4). They found that the specific constants the paper printed in equation (5) do not actually reproduce its conclusion: minimized under the budget, those numbers imply roughly seventy tokens per parameter, not twenty, contradicting the paper's own first two approaches and the recipe it used to build Chinchilla. They also noted the reported confidence intervals were implausibly tight, tight enough to need hundreds of thousands of runs rather than a few hundred. Re-fitting the data restored the familiar number, about twenty tokens per parameter, in line with everything else.

So the headline is solid and one table in the appendix is not. The two empirical approaches (the IsoFLOP valleys and the training-curve envelope) and the Chinchilla model itself all point to scaling both equally at about twenty to one. The parametric formula is a real and useful object; the particular constants the paper happened to print for it were off. The figures in this piece use the corrected constants, which is why their valleys sit at twenty, not seventy. It is a good reminder that a paper's conclusion and any one of its fitted numbers are separate things, and both deserve checking.

The limits the authors name are honest ones. The whole analysis rests on a power-law fit, and at the very largest budgets they see a slight bend in the curve, which hints the truly optimal models might be even smaller than the straight line predicts. There were only two runs at full scale, Chinchilla and Gopher, so the decisive test is a single A/B comparison, convincing but not a sweep. Every run saw each token roughly once, so nothing here speaks to what happens when you train for many passes over a smaller corpus. And the trillions of tokens the law calls for have to come from somewhere, which turns data quality and data collection into first-class problems rather than afterthoughts.

Step back and the argument is four moves long. Compute is six FLOPs per parameter per token, so a budget fixes the product of size and data. A fixed budget has a best split, visible as the bottom of an IsoFLOP valley. Fit the loss surface and that best split has a closed form, with size and tokens carrying nearly equal exponents. So when you get more compute, grow both, about twenty tokens for every parameter. The giants were the right price and the wrong shape. They were hungry.

Provenance Verified against primary literature
Chinchilla (2022)Hoffmann et al.: the three approaches, the 6ND budget, the parametric loss L = E + A/Nᵃ + B/Dᵇ, and the 70B / 1.4T Chinchilla model.
Kaplan et al. (2020)The earlier scaling law this paper corrects: Nₒₚₜ ∝ C^0.73, Dₒₚₜ ∝ C^0.27 (scale mostly size).
Gopher (2021)Rae et al.: the 280B model trained on 300B tokens that Chinchilla matches in compute and beats in quality.
Epoch AI replication (2024)Besiroglu et al. re-fit Approach 3 from the paper’s own figure data: the figures use their corrected constants.
correctionThe paper’s Approach 3 prints fitted constants (A=406.4, B=410.7, α=0.34, β=0.28) whose closed-form optimum implies roughly 70 tokens per parameter, contradicting its own Approaches 1 and 2 and the 20-to-1 rule used to build Chinchilla. A 2024 replication (Epoch AI, arXiv:2404.10102) re-fit the data and recovered about 20. The figures here use the corrected constants so they match the headline; the formula in the prose shows the published numbers, with this discrepancy called out.

Questions you might still have

?

If the answer is "20 tokens per parameter," why train any model that isn’t exactly compute-optimal?
Because compute-optimal only counts training. A model is also paid for at inference, every time it is used. A smaller model trained past its compute-optimal point is more expensive to train but cheaper to serve forever after, which is why production models are now routinely trained well beyond 20 tokens per parameter.

?

Where does the factor of 6 in C = 6ND come from?
Two FLOPs (a multiply and an add) per parameter per token on the forward pass, and the backward pass costs about twice the forward, so 2 + 4 = 6. It ignores attention’s sequence-length term, which is small for these model sizes, so it is an approximation the paper checks against an exact count.

?

Does the famous parametric formula actually reproduce the 20-to-1 rule?
Not with the constants the paper printed. A 2024 replication showed those numbers imply roughly 70 tokens per parameter, at odds with the paper’s own first two methods. Re-fitting the data restores about 20. The headline conclusion holds; one set of fitted constants in the paper does not.

Footnotes & further reading

  1. The paper: Hoffmann, Borgeaud, Mensch, Buchatskaya, Cai, Rutherford, et al., Training Compute-Optimal Large Language Models (DeepMind, NeurIPS 2022).
  2. The earlier scaling law this paper revises: Kaplan, McCandlish, et al., Scaling Laws for Neural Language Models (OpenAI, 2020), which recommended putting most new compute into model size.
  3. Gopher, the 280B baseline Chinchilla matches in compute and beats in quality: Rae et al., Scaling Language Models: Methods, Analysis & Insights from Training Gopher.
  4. The replication that re-fit Approach 3 and flagged the inconsistent constants: Besiroglu, Erdil, Barnett, You, Chinchilla Scaling: A Replication Attempt (Epoch AI, 2024). The corrected constants used in the figures here come from this work.
  5. Why production models are now trained well past the compute-optimal point: Sardana et al., Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws.
  6. MMLU, the multiple-choice exam benchmark Chinchilla set a record on: Hendrycks et al., Measuring Massive Multitask Language Understanding.