VerifiedarXiv:2001.0836120 min
Scaling · LLMs

Scaling Laws for Neural Language Models

Bigger model, more data, more compute: the loss drops on a curve you can predict.

Train a few hundred Transformers across a billion-fold range of sizes and the test loss does not wander. It traces clean power laws in model size, data, and compute. The same equations say how to spend a compute budget, and where the whole approach must eventually break.

Explaining the paperScaling Laws for Neural Language ModelsKaplan, McCandlish, et al. · OpenAI · 2020 · arXiv:2001.08361

How much better does a language model get if you make it twice as big? For once there is a number.

Before this paper, scaling a model up was an act of faith. You added parameters, or data, or GPUs, and hoped the loss came down. Kaplan and colleagues at OpenAI turned that hope into measurement. They trained Transformers across more than six orders of magnitude in size, on datasets spanning two, with compute budgets spanning eight, and found that the held-out loss falls along smooth, straight lines. The lines have equations.

The headline is suspiciously clean. The cross-entropy loss (the model's average surprise on text it has not seen, measured in nats) depends on three quantities: the number of parameters NN, the number of training tokens DD, and the compute CC spent. Hold two of them out of the way and the loss is a power law in the third. A power law is a straight line on log-log axes, and these lines stay straight across the entire range tested, with no sign of bending at the top end.

That regularity is the whole paper, and it pays off twice. It lets you predict: fit the line on small models and read off what a model a hundred times larger will do, before you build it. And it lets you allocate: with the loss written as a function of size, data, and compute, you can ask which split of a fixed budget gives the lowest loss. The answer Kaplan got, spend it on size and train briefly, shaped two years of model building. It was also, in its specifics, wrong, and the last section shows how.

The argument is a short tower, built in order. What a power law is and why one shows up here. Why only the scale matters and the shape barely does. How data caps the size you can use. Why bigger models learn faster. How to spend a compute budget. And where the lines have to give out.

A power law is a straight line, if you look right

A power law says one quantity is a fixed power of another, L=(Xc/X)αL = (X_c/X)^{\alpha}. Take the logarithm of both sides and it becomes logL=αlogXcαlogX\log L = \alpha\log X_c - \alpha\log X, the equation of a straight line with slope α-\alpha. So on axes that plot both quantities by their logarithms (log-log axes), a power law is exactly a line, and α\alpha is how steeply it falls.

This matters because the quantities here span orders of magnitude. On ordinary linear axes a relationship over a billion-fold range is unreadable: the loss plunges at the far left and then crawls along a flat tail, and you cannot tell a power law from a dozen other shapes. Switch the axes to log-log, plot the same loss numbers, and they snap into a line whose slope you can measure. Drag the slider to morph the axes from linear to log-log, for each of the three quantities, and watch it straighten:

Figure 1 · the power law
linear
The same loss law on two kinds of axes. On linear axes it plunges then flattens into an unreadable tail. Slide to log-log and it becomes a straight line whose slope is α-\alpha. The dashed amber chord is the straight target; the gap closes as the curve straightens. Each of L(N), L(D), L(C) is its own power law.

The three laws the paper fits all share one shape. Loss against the number of (non-embedding) parameters NN:

L(N)=(Nc/N)αN,αN0.076,  Nc8.8×1013L(N) = \left(N_c/N\right)^{\alpha_N}, \qquad \alpha_N \approx 0.076,\ \ N_c \approx 8.8\times 10^{13}
(1)

Loss against the dataset size DD in tokens:

L(D)=(Dc/D)αD,αD0.095,  Dc5.4×1013L(D) = \left(D_c/D\right)^{\alpha_D}, \qquad \alpha_D \approx 0.095,\ \ D_c \approx 5.4\times 10^{13}
(2)

And loss against the compute CC used, when that compute is spent efficiently:

L(Cmin)=(Ccmin/Cmin)αC,αC0.050,  Ccmin3.1×108L(C_{\min}) = \left(C_c^{\min}/C_{\min}\right)^{\alpha_C}, \qquad \alpha_C \approx 0.050,\ \ C_c^{\min} \approx 3.1\times 10^{8}
(3)

The subscript on CminC_{\min} marks a fine point we return to: it is the compute a run would use at a small batch size, where none is wasted on parallelism, and it tracks the raw compute CC closely. The constants Nc,Dc,CcN_c, D_c, C_c depend on the vocabulary and tokenizer (these are the values for the WebText2 corpus) and carry no deep meaning. The exponents are the content, and they are small, which is the sobering part. αN0.076\alpha_N \approx 0.076 means that doubling the model multiplies the loss by 2αN2^{-\alpha_N}, about 0.95: a five percent relative drop for twice the parameters, and each further doubling buys the same five percent, steady in ratio, shrinking in absolute terms. That is what a power law is. Five percent sounds tiny, but it compounds: ten doublings, a model a thousand times bigger, multiply out to roughly 0.59 of the original loss. The gain per doubling never jumps and never dies, which is the promise and the limit of scaling in one fact.

Cross-entropy in nats is the vertical axis on every figure here: the model's average log-perplexity, its surprise at the next token. A loss of LL nats corresponds to a perplexity of eLe^{L}, as if the model were choosing uniformly among that many equally likely tokens. A nat is the natural-log version of a bit (one nat is about 1.44 bits).

What matters is scale, not shape

Notice what is missing from those equations. Depth. Width. The number of attention heads. The aspect ratio. None of them appears. The authors varied all of these at a fixed parameter count and found the loss moves by a few percent at most. A (nlayer,dmodel)=(6,4288)(n_{\text{layer}}, d_{\text{model}}) = (6, 4288) model lands within 3% of a (48,1600)(48, 1600) model of nearly the same parameter count. Across the shapes they tried, the aspect ratio varies by a factor of forty with only a few percent of movement in the loss. Almost all of the architecture you might fuss over washes out, and what is left is one number: how many parameters there are.

That number needs care. NN counts only the non-embedding parameters: the attention and feed-forward weights, not the token-embedding or positional tables. For the standard Transformer shape this is

N12nlayerdmodel2N \approx 12\,n_{\text{layer}}\,d_{\text{model}}^{2}
(4)

The decision to drop the embeddings is deliberate. They scale differently from the rest of the network, and excluding them makes the L(N)L(N) line cleaner and consistent across depths. (The choice returns at the very end, as one of the reasons the compute-optimal recipe came out skewed.)

The third quantity, compute, also has a clean form. Pushing one token through the network costs about 2N2N floating-point operations on the forward pass, one multiply and one add per parameter, and the backward pass that produces gradients costs about twice that. So training touches roughly 6N6N operations per token. Over batches of size BB for SS steps:

C6NBS,(6N FLOPs per token)C \approx 6\,N\,B\,S, \qquad \text{(} \approx 6N \text{ FLOPs per token)}
(5)

Compute is quoted in petaflop-days. One PF-day is 101510^{15} operations per second sustained for a day, about 8.6×10198.6\times 10^{19} operations. With these three definitions pinned down, the laws are something you can actually compute with:

# the fitted scaling laws (Kaplan et al., WebText2, loss in nats)
aN, Nc = 0.076, 8.8e13         # params: L(N) = (Nc / N) ** aN
aD, Dc = 0.095, 5.4e13         # data:   L(D) = (Dc / D) ** aD

def loss_from_params(N):       # N = non-embedding parameters
    return (Nc / N) ** aN      # a straight line on log-log axes

loss_from_params(1.5e9)        # ~1.5B params -> about 2.3 nats/token

Data sets a ceiling on the size you can use

The three laws above each hold the other two quantities out of the way. The interesting questions start when two move together. Take parameters and data. If you have a fixed amount of text and keep growing the model, at some point the model is large enough to memorize the quirks of that particular text rather than the language behind it. The test loss stops improving and then gets worse. That is overfitting, and the paper pins down when it starts.

The two effects combine into one equation, the early-stopped loss as a function of both NN and DD (a single joint fit to the two-variable runs, so its constants come out a touch different from the standalone laws above):

L(N,D)=[(NcN)αN/αD+DcD]αDL(N, D) = \left[\, \left(\frac{N_c}{N}\right)^{\alpha_N/\alpha_D} + \frac{D_c}{D} \,\right]^{\alpha_D}
(6)

Read the two terms inside the bracket. The first is the model's own limit, the (Nc/N)(N_c/N) power from before. The second, Dc/DD_c/D, is the data's limit. Whichever is larger dominates. With plenty of data the first term wins and the loss rides the clean L(N)L(N) line. When the model outgrows the data the second term takes over and the loss flattens onto a floor set by DD alone. Drag the dataset size and watch each curve peel off the infinite-data line and level out:

Figure 2 · overfitting
583M tok
Loss versus model size, on log-log axes. The dashed line is the infinite-data power law. Each fixed dataset follows it at small sizes, then bends to a floor once the model outgrows the data. More data pushes the overfitting onset to larger models. The penalty is governed by the ratio N0.74/DN^{0.74}/D.

The crossover happens when the two terms are comparable, which the paper summarizes with one ratio: overfitting is governed by N0.74/DN^{0.74}/D. The exponent 0.74 is αN/αD\alpha_N/\alpha_D from the joint fit, which gives a slightly steeper αD0.103\alpha_D \approx 0.103, so the ratio is 0.74 rather than the 0.80 you would get from dividing the rounded headline numbers. As a rule of thumb:

D(5×103)N0.74D \gtrsim \left(5\times 10^{3}\right) N^{0.74}
(7)

To keep overfitting below the noise from random seeds (about 0.02 nats), the data should grow as the 0.74 power of the model. That sublinear exponent is the useful part. Grow the model by a factor of eight and you need only about 80.748^{0.74}, roughly 4.7, times as much data to stay safe, which the paper rounds to "about five." Data has to grow with the model, but slower than it. On the 22-billion-token WebText2 set the authors used, this means models below about a billion parameters train with no real overfitting, and only the very largest begin to feel the ceiling.

It is worth running the laws once end to end on a concrete model, because that is when the predictive power stops being abstract. Suppose you are about to train a model with one billion non-embedding parameters, N=109N = 10^{9}. Equation (1) tells you the loss to expect before you spend a single GPU-hour. Plug it in: L(N)=(Nc/N)αNL(N) = (N_c/N)^{\alpha_N} with Nc=8.8×1013N_c = 8.8\times 10^{13} and αN=0.076\alpha_N = 0.076. The ratio Nc/NN_c/N is 8.8×1013/109=8.8×1048.8\times 10^{13}/10^{9} = 8.8\times 10^{4}, and raising that to the 0.0760.076 power gives about 2.382.38 nats per token. That is a prediction with a unit you can feel: a loss of 2.382.38 nats is a perplexity of e2.38e^{2.38}, near 10.810.8, so the trained model will be about as unsure as if it were guessing uniformly among eleven equally likely next tokens.

The second law tells you how much text that model needs before the prediction holds. The loss of 2.382.38 nats is the infinite-data value; it is only reachable if the dataset is large enough that the data term in (6) stays out of the way. Rule (7) sets the threshold: D(5×103)N0.74D \gtrsim (5\times 10^{3})\, N^{0.74}. For N=109N = 10^{9} that is 5×103×(109)0.745\times 10^{3} \times (10^{9})^{0.74}. The exponent first: (109)0.74=106.66(10^{9})^{0.74} = 10^{6.66}, about 4.6×1064.6\times 10^{6}, and multiplying by 5×1035\times 10^{3} gives roughly 2.3×10102.3\times 10^{10}, about twenty-three billion tokens. That lands right at the size of the WebText2 corpus the paper used, which is exactly why a one-billion-parameter model is the rough edge of where overfitting starts to bite on that dataset. Train it on fewer tokens and the loss settles above 2.382.38 nats on a data-set floor instead of riding the clean L(N)L(N) line; train it on more and you are paying for data the model is not yet large enough to need.

Bigger models learn faster

There is a second pairing worth watching: model size against training time. The natural worry about a huge model is that it must be slower to train, with all those parameters to fit. The data says the reverse. Larger models reach any given loss in fewer optimization steps, and from fewer tokens. They are more sample-efficient.

The learning curve has the same two-term shape, loss as a function of size and the number of steps SS:

L(N,S)=(NcN)αN+(ScS)αS,αS0.76,  Sc2.1×103L(N, S) = \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{S_c}{S}\right)^{\alpha_S}, \qquad \alpha_S \approx 0.76,\ \ S_c \approx 2.1\times 10^{3}
(8)

The first term is the converged floor, lower for a bigger model. The second is a training-time decay that is the same for everyone. So every model is descending the same slope toward its own floor, and a bigger model's floor sits lower, so its whole curve sits lower. Pick any target loss and the bigger model crosses it first. Drag the target and watch where each size gets there:

Figure 3 · sample efficiency
3.43 nats
Learning curves, loss versus training steps, one per model size (log-log). Bigger models sit on lower curves and cross any target loss in fewer steps. A model whose floor is already above the target never reaches it, no matter how long you train.

One detail the figure makes concrete: a model whose floor is already above your target never reaches it, no matter how long it trains. A small model is not a slow road to a good loss; past a point it is no road at all. This is the seed of the compute argument. If big models both end up better and get there in fewer steps, then for a fixed amount of compute you may do better training a big model briefly than a small one to convergence.

Spend it on size, and stop early

Now put compute in charge. You have a fixed budget CC, and by C6NBSC \approx 6NBS that budget is a product: a bigger model spends more per step, so it takes fewer steps for the same compute. The question is which model size turns a given budget into the lowest loss.

Picture every model size as a curve of loss against compute as it trains. It descends, then flattens at its converged floor. Lay all those curves on top of one another and trace the lower edge. That lower envelope is the best loss reachable at each compute level, the compute-efficient frontier:

Figure 4 · the compute frontier
30 PF-d
Loss versus compute (log-log). Each dim curve is one model size training over time, flattening at its converged loss. The amber envelope is the compute-efficient frontier. It touches each curve while the model is still improving, before convergence, so the optimal move is to train a big model and stop early. N ⁣C0.73N^{*}\!\propto C^{0.73}.

The frontier has a feature that decides everything. It touches each model's curve while that curve is still descending, before it has flattened. At the moment a model becomes the compute-optimal choice, it is not yet converged, and the right move is to stop it there and put the next slice of compute into a larger model instead. Training to convergence wastes compute that a bigger model would have spent better. The reason is marginal value: near convergence each extra step buys almost nothing, so the last stretch of training is the worst compute you can spend. It is cheaper to abandon the run while it is still improving and put the FLOPs into a bigger model, which is why compute-optimal training stops every model early.

Fitting the frontier gives the allocation. The optimal model size grows as

N(Cmin)Cmin0.73N(C_{\min}) \propto C_{\min}^{\,0.73}
(9)

while the batch size grows as C0.24C^{0.24} (tracking the critical batch size, the point past which more parallelism stops buying speed) and the number of steps as C0.03C^{0.03}, which is nearly flat. (The data, D=BSD = B\cdot S, therefore grows only as about C0.27C^{0.27}.) Put it together: as you get more compute, pour almost all of it into a bigger model, grow the batch to match, and barely train any longer.

# how Kaplan says to spend 10x more compute (exponents, eq 6.1-6.2)
factor = 10
N_up = factor ** 0.73          # model size  -> 5.4x bigger
B_up = factor ** 0.24          # batch size  -> 1.7x
S_up = factor ** 0.03          # train steps -> 1.07x  (almost flat)
# nearly all of it buys a bigger model; you barely train any longer

This was the paper's practical headline, the part the field followed, and the part that came out skewed. The mechanism is sound: train a big model along the frontier and stop it early. The exponent 0.73 is the number later work corrected, and the last section traces how.

The wall, and a conjecture

The compute law contains a contradiction the authors are careful to flag. Compute-optimal training grows the data slowly, as about C0.27C^{0.27}. But the loss the compute law predicts falls as C0.05C^{-0.05}, faster than the loss floor that little data can support, which falls only as about C0.03C^{-0.03}. Two power laws with different slopes have to cross.

Beyond the crossing the compute law would predict a loss lower than the data on hand can possibly deliver, which is impossible. So the laws must break down at or before that point. The crossing sits far past anything tested:

C104 PF-days,N1012,D1012,L1.7 nats/tokenC^{*} \sim 10^{4}\ \text{PF-days}, \quad N^{*} \sim 10^{12}, \quad D^{*} \sim 10^{12}, \quad L^{*} \sim 1.7\ \text{nats/token}
(10)

Slide the compute up and watch the predicted compute trend dive toward the achievable data floor until the two meet at the wall:

Figure 5 · the wall
9 PF-d
The compute trend falls faster (slope −0.05) than the data-limited floor (slope −0.03), so it would cross below it at about 10410^{4} PF-days and L ⁣1.7L^{*}\!\sim 1.7 nats. Past the crossing the prediction is impossible, so the laws break. The authors conjecture LL^{*} is a rough estimate of the entropy of language.

The authors stress that these numbers are uncertain by an order of magnitude in either direction. Then they offer a conjecture, and label it as one. Perhaps the crossing is where a model has extracted everything reliable in the text, so L1.7L^{*} \approx 1.7 nats per token would be a rough estimate of the entropy of natural language itself, the floor no amount of scale can beat. It is a guess, and a good one.

What it changed, and what it got wrong

The paper reset how the field reasoned about scale. Before it, scaling was a hunch. After it, you fit a line and extrapolated. GPT-3, a year later, was in large part a bet on these curves: build a model a hundred times bigger than anything before and trust the loss to keep falling. It did. The authors reach for the ideal-gas law as an analogy, and it fits. Simple macroscopic relations that hold regardless of most of the microscopic detail.

The allocation advice is where the story turns. Pour new compute into model size, barely train longer: the prescription came out of the same fits that had just been vindicated at GPT-3 scale, so the field followed it, and GPT-3, Gopher, and Megatron-Turing were all built to its shape. Then in 2022 DeepMind's Chinchilla redid the measurement and found that size and data should grow about equally, each as roughly C0.5C^{0.5}, near twenty tokens per parameter, not the size-heavy C0.73C^{0.73} this paper reported. The power laws themselves were never in dispute; the disagreement was only about how to divide a budget. But the budget split was exactly what the field had taken from the paper, and the models built on Kaplan's recipe were the right total cost and too large for the data they were shown.

Chinchilla's own hypothesis for the gap was a single bug: a fixed cosine learning-rate schedule, read off at the wrong horizons. That is the story most people repeat, and it is only part of it. Two 2024 replications took the gap apart. Pearce and Song found the largest single cause was bookkeeping: Kaplan counted non-embedding parameters, the choice that made the L(N)L(N) line clean, and at the small scales where these fits live the embedding and output-head terms are a sizable fraction, so the fitted exponent came out too steep. Porian and colleagues ranked the rest: omitting the decoder-head compute, a warmup held fixed at 3000 steps that was too long for the smallest models, and not re-tuning the optimizer per scale. They reproduce Chinchilla's near-0.5 exponent even with a constant learning rate, which rules the schedule out as the essential cause. None of it touches the power laws; the error lived in the small-scale accounting done before the extrapolation.

What survives is the shape. Loss is a power law in scale. The power laws compose into a joint law over size and data. The joint law tells you how to spend a budget, and the same law shows where the whole approach must stop. The specific compute-optimal exponent was an artifact of how the smallest models were measured; the discovery underneath it, that the loss of a language model is a smooth, predictable function of scale, is one of the load-bearing facts of the last decade.

Provenance Verified against primary literature
Kaplan et al. (2020)The power laws L(N), L(D), L(C_min), the joint L(N,D), the 6N compute rule, and the size-heavy compute-optimal Nₒₚₜ ∝ C^0.73, all on WebText2.
Chinchilla (2022)Hoffmann et al.: re-measured the compute-optimal split as roughly equal, Nₒₚₜ ∝ C^~0.5 and Dₒₚₜ ∝ C^~0.5 (about 20 tokens/param), correcting this paper.
Pearce & Song (2024)arXiv:2406.12907: counting non-embedding (not total) parameters at small scale is the single largest cause of Kaplan’s steeper exponent.
Porian et al. (2024)arXiv:2406.19146: ranks the rest (decoder-head FLOPs, over-long warmup, per-scale optimizer tuning); the learning-rate decay is the weakest lever.
correctionThis paper’s compute-optimal prescription (pour most new compute into model size, Nₒₚₜ ∝ C^0.73, and stop well before convergence) was overturned by Chinchilla (2022), which found size and data should scale about equally, each ∝ C^~0.5. The power laws themselves hold; only the optimal split was off. The cause is methodological and multi-factor, not one bug: counting non-embedding instead of total parameters at small scale (Pearce & Song, 2406.12907), an over-long fixed warmup, omitting the decoder-head FLOPs, and not re-tuning the optimizer per scale (Porian et al., 2406.19146). The popular "it was just the cosine learning-rate schedule" story is Chinchilla’s own hypothesis and is incomplete: Porian et al. match Chinchilla even with a constant learning rate.

Questions you might still have

?

Are these laws still believed, now that Chinchilla exists?
The power laws are, yes. Loss does fall as a smooth power law in size, data, and compute, and that has held up across many later models. What Chinchilla overturned was only this paper’s compute-optimal split (Nₒₚₜ ∝ C^0.73). The right split is closer to equal, about 20 tokens per parameter. Same curves, different advice on how to ride them.

?

Why count only non-embedding parameters?
Because the embedding and positional tables scale differently from the rest of the network, and including them bends the L(N) line and makes it depend on depth. Dropping them gives a cleaner law. The catch: at the small model sizes these fits live on, the embedding and output-head terms are a large fraction, and this very choice is part of why the compute-optimal exponent came out too steep.

?

If the loss keeps falling as a power law, does the model keep getting better forever?
No, for two reasons. A power law means diminishing returns: each further factor of ten in compute buys a smaller and smaller absolute drop. And the paper’s own §6.3 shows the laws must break around L* ~ 1.7 nats per token, where the compute trend would demand more from the data than a single pass can give. The gains keep coming and keep getting smaller, and there is a floor.

?

Loss is in "nats." What does a loss of 2 actually mean?
Nats are the natural-log version of bits (one nat ≈ 1.44 bits). A cross-entropy of L nats is a perplexity of e^L: the model is, on average, as unsure as if it were guessing uniformly among that many equally likely next tokens. A loss of 2 nats is a perplexity near 7.4.

Footnotes & further reading

  1. The paper: Kaplan, McCandlish, Henighan, Brown, Chess, Child, Gray, Radford, Wu, Amodei, Scaling Laws for Neural Language Models (OpenAI, 2020).
  2. The follow-up that corrected the compute-optimal split: Hoffmann et al., Training Compute-Optimal Large Language Models (Chinchilla, DeepMind 2022). We have an explainer of Chinchilla too.
  3. The two reconciliations of the Kaplan / Chinchilla gap: Pearce & Song, Reconciling Kaplan and Chinchilla Scaling Laws (2024), and Porian, Teterwak, Carmon et al., Resolving Discrepancies in Compute-Optimal Scaling of Language Models (NeurIPS 2024).
  4. The WebText2 corpus and the reversible byte-pair tokenizer come from the GPT-2 work: Radford et al., Language Models are Unsupervised Multitask Learners (2019).
  5. The bet these curves enabled: Brown et al., Language Models are Few-Shot Learners (GPT-3, 2020), and our GPT-3 explainer.