Scaling · LLMs

Scaling Laws for Neural Language Models

Bigger model, more data, more compute: the loss drops on a curve you can predict.

Train a few hundred Transformers across a billion-fold range of sizes and the test loss is not erratic. It follows clean power laws in model size, data, and compute. The same equations say how to spend a compute budget, and where the approach must eventually break.

Explaining the paperScaling Laws for Neural Language ModelsKaplan, McCandlish, et al. · OpenAI · 2020 · arXiv:2001.08361 ↗

How much better does a language model get if you make it twice as big? For once there is a number.

Before this paper, scaling a model up was an act of faith. You added parameters, or data, or GPUs, and hoped the loss came down. Kaplan and colleagues at OpenAI turned that hope into measurement. They trained Transformers across more than six orders of magnitude in size, two orders of magnitude in dataset size, and eight in compute budget, and found that the held-out loss (the loss on text the model was not trained on) falls along smooth, straight lines. The lines have equations.

The headline is suspiciously neat. The cross-entropy loss (the model's average surprise on text it has not seen, measured in nats) depends on three quantities: the number of parameters $N$ , the number of training tokens $D$ , and the compute $C$ spent. Hold two of them out of the way and the loss is a power law in the third. A power law is a straight line on log-log axes, and these lines stay straight across the entire range tested, with no sign of bending at the top end.

That regularity is the paper's one finding, and it is useful in two ways. It lets you predict: fit the line on small models and read off what a model a hundred times larger will do, before you build it. And it lets you allocate: with the loss written as a function of size, data, and compute, you can ask which split of a fixed budget gives the lowest loss. The answer Kaplan got, spend it on size and train briefly, shaped two years of model building. It was also, in its specifics, wrong, and the last section shows how.

We take the argument in order. What a power law is and why one shows up here. Why only the scale matters and the shape barely does. How data caps the size you can use. Why bigger models learn faster. How to spend a compute budget. And where the lines have to give out.

A power law is a straight line, if you look right

A power law says one quantity is a fixed power of another, $L = (X_c/X)^{\alpha}$ . Take the logarithm of both sides and it becomes $\log L = \alpha\log X_c - \alpha\log X$ , the equation of a straight line with slope $-\alpha$ . So on axes that plot both quantities by their logarithms (log-log axes), a power law is exactly a line, and $\alpha$ is how steeply it falls.

This matters because the quantities here span orders of magnitude. On ordinary linear axes a relationship over a billion-fold range is unreadable: the loss drops steeply at the far left and then declines very slowly along a flat tail, and you cannot tell a power law from a dozen other shapes. Switch the axes to log-log, plot the same loss numbers, and they snap into a line whose slope you can measure. Drag the slider to morph the axes from linear to log-log, for each of the three quantities, and watch it straighten:

Figure 1 · the power law

axeslinear

The same loss law on two kinds of axes. On linear axes it plunges then flattens into an unreadable tail. Slide to log-log and it becomes a straight line whose slope is

-\alpha

. The dashed amber chord is the straight target; the gap closes as the curve straightens. Each of L(N), L(D), L(C) is its own power law.

The three laws the paper fits all share one shape. Loss against the number of (non-embedding) parameters $N$ :

L(N) = \left(N_c/N\right)^{\alpha_N}, \qquad \alpha_N \approx 0.076,\ \ N_c \approx 8.8\times 10^{13}

(1)

Loss against the dataset size $D$ in tokens:

L(D) = \left(D_c/D\right)^{\alpha_D}, \qquad \alpha_D \approx 0.095,\ \ D_c \approx 5.4\times 10^{13}

(2)

And loss against the compute $C$ used, when that compute is spent efficiently:

L(C_{\min}) = \left(C_c^{\min}/C_{\min}\right)^{\alpha_C}, \qquad \alpha_C \approx 0.050,\ \ C_c^{\min} \approx 3.1\times 10^{8}

(3)

The subscript on $C_{\min}$ marks a fine point about batch-size efficiency: it is the compute a run would use at a small batch size, where every example contributes (above a critical batch size, extra parallelism speeds wall-clock time but buys less per gradient step), and it tracks the raw compute $C$ closely. The constants $N_c, D_c, C_c$ depend on the vocabulary and tokenizer (these are the values for the WebText2 corpus) and carry no deep meaning. The exponents carry the meaning, and they are small. $\alpha_N \approx 0.076$ means that doubling the model multiplies the loss by $2^{-\alpha_N}$ , about 0.95: a five percent relative drop for twice the parameters, and each further doubling buys the same five percent, steady in ratio, shrinking in absolute terms. That is what a power law is. Five percent sounds tiny, but it compounds: ten doublings, a model a thousand times bigger, multiply out to roughly 0.59 of the original loss. The gain per doubling never jumps and never reaches zero, which both promises continued gains and bounds them.

Cross-entropy in nats is the vertical axis on every figure here: the model's average log-perplexity, its surprise at the next token. A loss of $L$ nats corresponds to a perplexity of $e^{L}$ , as if the model were choosing uniformly among that many equally likely tokens. A nat is the natural-log version of a bit (one nat is about 1.44 bits).

What matters is scale, not shape

Several familiar quantities are missing from those equations. Depth. Width. The number of attention heads. The aspect ratio. None of them appears. The authors varied all of these at a fixed parameter count and found the loss moves by a few percent at most. A $(n_{\text{layer}}, d_{\text{model}}) = (6, 4288)$ model lands within 3% of a $(48, 1600)$ model of nearly the same parameter count. Across the shapes they tried, the aspect ratio varies by a factor of forty with only a few percent of movement in the loss. Almost all of the architecture washes out, and what remains is one number: how many parameters there are.

That number needs care. $N$ counts only the non-embedding parameters: the attention and feed-forward weights, not the token-embedding or positional tables. For the standard Transformer shape this is

N \approx 12\,n_{\text{layer}}\,d_{\text{model}}^{2}

(4)

The decision to drop the embeddings is deliberate. They scale differently from the rest of the network, and excluding them makes the $L(N)$ line straighter and consistent across depths. (This choice is one of the reasons the compute-optimal recipe came out skewed.)

The third quantity, compute, also has a clean form. Pushing one token through the network costs about $2N$ floating-point operations on the forward pass, one multiply and one add per parameter, and the backward pass that produces gradients costs about twice that. So training touches roughly $6N$ operations per token. Over batches of size $B$ for $S$ steps:

C \approx 6\,N\,B\,S, \qquad \text{(} \approx 6N \text{ FLOPs per token)}

(5)

Compute is quoted in petaflop-days. One PF-day is $10^{15}$ operations per second sustained for a day, about $8.6\times 10^{19}$ operations. With these three definitions pinned down, you can compute the laws directly:

# the fitted scaling laws (Kaplan et al., WebText2, loss in nats)
aN, Nc = 0.076, 8.8e13         # params: L(N) = (Nc / N) ** aN
aD, Dc = 0.095, 5.4e13         # data:   L(D) = (Dc / D) ** aD

def loss_from_params(N):       # N = non-embedding parameters
    return (Nc / N) ** aN      # a straight line on log-log axes

loss_from_params(1.5e9)        # ~1.5B params -> about 2.3 nats/token

Data sets a ceiling on the size you can use

The three laws above each hold the other two quantities out of the way. The interesting questions start when two move together. Take parameters and data. If you have a fixed amount of text and keep growing the model, at some point the model is large enough to memorize the quirks of that particular text rather than the language behind it. The test loss stops improving and then gets worse. That is overfitting, and the paper pins down when it starts.

The two effects combine into one equation, the early-stopped loss as a function of both $N$ and $D$ (a single joint fit to the two-variable runs, so its constants come out a touch different from the standalone laws above):

L(N, D) = \left[\, \left(\frac{N_c}{N}\right)^{\alpha_N/\alpha_D} + \frac{D_c}{D} \,\right]^{\alpha_D}

(6)

The two terms inside the bracket each have a meaning. The first is the model's own limit, the $(N_c/N)$ power from before. The second, $D_c/D$ , is the data's limit. Whichever is larger dominates. With plenty of data the first term is larger and the loss follows the $L(N)$ line. When the model outgrows the data the second term is larger and the loss flattens onto a floor set by $D$ alone. Drag the dataset size and watch each curve peel off the infinite-data line and level out:

Figure 2 · overfitting

data D583M tok

Loss versus model size, on log-log axes. The dashed line is the infinite-data power law. Each fixed dataset follows it at small sizes, then bends to a floor once the model outgrows the data. More data pushes the overfitting onset to larger models. The penalty is governed by the ratio

N^{0.74}/D

The crossover happens when the two terms are comparable, which the paper summarizes with one ratio: overfitting is governed by $N^{0.74}/D$ . The exponent 0.74 is $\alpha_N/\alpha_D$ from the joint fit, which gives a slightly steeper $\alpha_D \approx 0.103$ , so the ratio is 0.74 rather than the 0.80 you would get from dividing the rounded headline numbers. As a rule of thumb:

D \gtrsim \left(5\times 10^{3}\right) N^{0.74}

(7)

To keep overfitting below the noise from random seeds (the loss wobble of about 0.02 nats you would see just by re-running the same model with a different random initialization), the data should grow as the 0.74 power of the model. That sublinear exponent is the useful part. Grow the model by a factor of eight and you need only about $8^{0.74}$ , roughly 4.7, times as much data to stay safe, which the paper rounds to "about five." Data has to grow with the model, but slower than it. On the 22-billion-token WebText2 set the authors used, this means models below about a billion parameters train with no real overfitting, and only the very largest begin to feel the ceiling.

Worked end to end on a concrete model, the laws look like this. Suppose you are about to train a model with one billion non-embedding parameters, $N = 10^{9}$ . Equation (1) tells you the loss to expect before you spend a single GPU-hour. Plug it in: $L(N) = (N_c/N)^{\alpha_N}$ with $N_c = 8.8\times 10^{13}$ and $\alpha_N = 0.076$ . The ratio $N_c/N$ is $8.8\times 10^{13}/10^{9} = 8.8\times 10^{4}$ , and raising that to the $0.076$ power gives about $2.38$ nats per token. That is a prediction with a concrete, interpretable unit: a loss of $2.38$ nats is a perplexity of $e^{2.38}$ , near $10.8$ , so the trained model will be about as unsure as if it were guessing uniformly among eleven equally likely next tokens.

The second law tells you how much text that model needs before the prediction holds. The loss of $2.38$ nats is the infinite-data value; it is only reachable if the dataset is large enough that the data term in (6) stays out of the way. Rule (7) sets the threshold: $D \gtrsim (5\times 10^{3})\, N^{0.74}$ . For $N = 10^{9}$ that is $5\times 10^{3} \times (10^{9})^{0.74}$ . The exponent first: $(10^{9})^{0.74} = 10^{6.66}$ , about $4.6\times 10^{6}$ , and multiplying by $5\times 10^{3}$ gives roughly $2.3\times 10^{10}$ , about twenty-three billion tokens. That lands right at the size of the WebText2 corpus the paper used, which is exactly why a one-billion-parameter model is the rough edge of where overfitting starts to bite on that dataset. Train it on fewer tokens and the loss settles above $2.38$ nats on a data-set floor instead of following the $L(N)$ line; train it on more and you are paying for data the model is not yet large enough to need.

Bigger models learn faster

There is a second pairing: model size against training time. The natural worry about a huge model is that it must be slower to train, with all those parameters to fit. The data says the reverse. Larger models reach any given loss in fewer optimization steps, and from fewer tokens. They are more sample-efficient.

The learning curve has the same two-term shape, loss as a function of size and the number of steps $S$ :

L(N, S) = \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{S_c}{S}\right)^{\alpha_S}, \qquad \alpha_S \approx 0.76,\ \ S_c \approx 2.1\times 10^{3}

(8)

The first term is the converged floor, lower for a bigger model. The second is a training-time decay that is the same for everyone. So every model is descending the same slope toward its own floor, and a bigger model's floor sits lower, so its entire curve sits lower. Pick any target loss and the bigger model crosses it first. Drag the target and watch where each size gets there:

Figure 3 · sample efficiency

target3.43 nats

Learning curves, loss versus training steps, one per model size (log-log). Bigger models sit on lower curves and cross any target loss in fewer steps. A model whose floor is already above the target never reaches it, no matter how long you train.

Past a point, a small model is not on a slow road to a good loss; it is on no road at all. Its converged floor simply sits above the target, and no amount of training lowers it. This observation leads into the compute argument. If big models both end up better and get there in fewer steps, then for a fixed amount of compute you may do better training a big model briefly than a small one to convergence.

Spend it on size, and stop early

Compute is the binding constraint in practice. You have a fixed budget $C$ , and by $C \approx 6NBS$ that budget is a product: a bigger model spends more per step, so it takes fewer steps for the same compute. The question is which model size turns a given budget into the lowest loss.

Every model size traces a curve of loss against compute as it trains. It descends, then flattens at its converged floor. Laying all those curves on top of one another, the lower edge is the best loss reachable at each compute level, the compute-efficient frontier:

Figure 4 · the compute frontier

budget30 PF-d

Loss versus compute (log-log). Each dim curve is one model size training over time, flattening at its converged loss. The amber envelope is the compute-efficient frontier. It touches each curve while the model is still improving, before convergence, so the optimal move is to train a big model and stop early.

N^{*}\!\propto C^{0.73}

The frontier touches each model's curve while that curve is still descending, before it has flattened. At the moment a model becomes the compute-optimal choice, it is not yet converged, and the right move is to stop it there and put the next slice of compute into a larger model instead. Training to convergence wastes compute that a bigger model would have spent better. Marginal value drives the allocation: near convergence each extra step buys almost nothing, so the last stretch of training yields the least loss reduction per FLOP. It is cheaper to abandon the run while it is still improving and put the FLOPs into a bigger model, which is why compute-optimal training stops every model early.

Fitting the frontier gives the allocation. The optimal model size grows as

N(C_{\min}) \propto C_{\min}^{\,0.73}

(9)

while the batch size grows as $C^{0.24}$ (tracking the critical batch size, the point past which more parallelism stops buying speed) and the number of steps as $C^{0.03}$ , which is nearly flat. (The data, $D = B\cdot S$ , therefore grows only as about $C^{0.27}$ .) Put it together: as you get more compute, pour almost all of it into a bigger model, grow the batch to match, and barely train any longer.

# how Kaplan says to spend 10x more compute (exponents, eq 6.1-6.2)
factor = 10
N_up = factor ** 0.73          # model size  -> 5.4x bigger
B_up = factor ** 0.24          # batch size  -> 1.7x
S_up = factor ** 0.03          # train steps -> 1.07x  (almost flat)
# nearly all of it buys a bigger model; you barely train any longer

This was the paper's practical headline, the part the field followed, and the part that came out skewed. The mechanism is sound: train a big model along the frontier and stop it early. The exponent 0.73 is the number later work corrected, and later replications traced the cause to small-scale measurement.

Where the laws break, and a conjecture

The compute law contains a contradiction the authors are careful to flag. Compute-optimal training grows the data slowly, as about $C^{0.27}$ . But the loss the compute law predicts falls as $C^{-0.05}$ , faster than the loss floor that little data can support, which falls only as about $C^{-0.03}$ . Two power laws with different slopes have to cross.

Beyond the crossing the compute law would predict a loss lower than the data on hand can possibly deliver, which is impossible. So the laws must break down at or before that point. The crossing sits far past anything tested:

C^{*} \sim 10^{4}\ \text{PF-days}, \quad N^{*} \sim 10^{12}, \quad D^{*} \sim 10^{12}, \quad L^{*} \sim 1.7\ \text{nats/token}

(10)

Slide the compute up and watch the predicted compute trend dive toward the achievable data floor until the two meet, the point past which the laws cannot hold:

Figure 5 · where the laws break

extrapolate9 PF-d

The compute trend falls faster (slope −0.05) than the data-limited floor (slope −0.03), so it would cross below it at about

10^{4}

PF-days and

L^{*}\!\sim 1.7

nats. Past the crossing the prediction is impossible, so the laws break. The authors conjecture

L^{*}

is a rough estimate of the entropy of language.

The authors stress that these numbers are uncertain by an order of magnitude in either direction. Then they offer a conjecture, and label it as one. Perhaps at the crossing a model has extracted everything reliable in the text, so $L^{*} \approx 1.7$ nats per token would be a rough estimate of the entropy of natural language itself, the floor no amount of scale can beat.

What it changed, and what it got wrong

The paper reset how the field reasoned about scale. Before it, scaling was a hunch. After it, you fit a line and extrapolated. GPT-3, a year later, was in large part a bet on these curves: build a model a hundred times bigger than anything before and trust the loss to keep falling. It did. The authors reach for the ideal-gas law as an analogy, and it fits: just as pressure, volume, and temperature obey one simple relation no matter how the individual molecules move, loss obeys one simple relation in size, data, and compute regardless of most architectural detail.

Later work corrected the allocation advice. Pour new compute into model size, barely train longer: the prescription came out of the same fits that had just been vindicated at GPT-3 scale, so the field followed it, and the large models of the era (GPT-3, DeepMind's Gopher, and Microsoft/NVIDIA's Megatron-Turing) were all built to its shape. Then in 2022 DeepMind's Chinchilla redid the measurement and found that size and data should grow about equally, each as roughly $C^{0.5}$ , near twenty tokens per parameter, not the size-heavy $C^{0.73}$ this paper reported. The power laws themselves were never in dispute; the disagreement was only about how to divide a budget. But the budget split was exactly what the field had taken from the paper, and the models built on Kaplan's recipe cost about the right total but were too large for the data they were shown.

Chinchilla's own hypothesis for the gap was a single bug: a fixed cosine learning-rate schedule (the learning rate decays to near zero only at a preset final step, so a checkpoint taken partway through a run that was tuned to end much later still has a high learning rate and a loss that reads artificially high), read off at the wrong horizons. That explanation is only part of the story. Two 2024 replications took the gap apart. Pearce and Song found the largest single cause was bookkeeping: Kaplan counted non-embedding parameters, the choice that made the $L(N)$ line well-behaved, and at the small scales where these fits live the embedding and output-head terms are a sizable fraction, so the fitted exponent came out too steep. Porian and colleagues ranked the rest: omitting the decoder-head compute, a warmup held fixed at 3000 steps that was too long for the smallest models, and not re-tuning the optimizer per scale. They reproduce Chinchilla's near-0.5 exponent even with a constant learning rate, which rules the schedule out as the essential cause. None of it touches the power laws; the error lived in the small-scale accounting done before the extrapolation.

What survives is the shape. Loss falls as a power law in scale, those laws compose into a joint law over size and data, and that joint law both tells you how to divide a budget and shows where the curves have to give out. The specific compute-optimal exponent was an artifact of how the smallest models were measured. The discovery underneath it, that the loss of a language model is a smooth, predictable function of scale, is one of the load-bearing facts of the last decade.

Provenance Verified against primary literature

Kaplan et al. (2020)The power laws L(N), L(D), L(C_min), the joint L(N,D), the 6N compute rule, and the size-heavy compute-optimal Nₒₚₜ ∝ C^0.73, all on WebText2.

Chinchilla (2022)Hoffmann et al.: re-measured the compute-optimal split as roughly equal, Nₒₚₜ ∝ C^~0.5 and Dₒₚₜ ∝ C^~0.5 (about 20 tokens/param), correcting this paper.

Pearce & Song (2024)arXiv:2406.12907: counting non-embedding (not total) parameters at small scale is the single largest cause of Kaplan’s steeper exponent.

Porian et al. (2024)arXiv:2406.19146: ranks the rest (decoder-head FLOPs, over-long warmup, per-scale optimizer tuning); the learning-rate decay is the weakest lever.

correctionThis paper’s compute-optimal prescription (pour most new compute into model size, Nₒₚₜ ∝ C^0.73, and stop well before convergence) was overturned by Chinchilla (2022), which found size and data should scale about equally, each ∝ C^~0.5. The power laws themselves hold; only the optimal split was off. The cause is methodological and multi-factor, not one bug: counting non-embedding instead of total parameters at small scale (Pearce & Song, 2406.12907), an over-long fixed warmup, omitting the decoder-head FLOPs, and not re-tuning the optimizer per scale (Porian et al., 2406.19146). The popular "it was just the cosine learning-rate schedule" story is Chinchilla’s own hypothesis and is incomplete: Porian et al. match Chinchilla even with a constant learning rate.

Questions you might still have

Are these laws still believed, now that Chinchilla exists?
The power laws are, yes. Loss does fall as a smooth power law in size, data, and compute, and that has held up across many later models. What Chinchilla overturned was only this paper’s compute-optimal split (Nₒₚₜ ∝ C^0.73). The right split is closer to equal, about 20 tokens per parameter. Same curves, different advice on how to ride them.

Why count only non-embedding parameters?
Because the embedding and positional tables scale differently from the rest of the network, and including them bends the L(N) line and makes it depend on depth. Dropping them gives a tighter law. There is a cost: at the small model sizes these fits live on, the embedding and output-head terms are a large fraction, and this very choice is part of why the compute-optimal exponent came out too steep.

If the loss keeps falling as a power law, does the model keep getting better forever?
No, for two reasons. A power law means diminishing returns: each further factor of ten in compute buys a smaller and smaller absolute drop. And the paper’s own §6.3 shows the laws must break around L* ~ 1.7 nats per token, where the compute trend would demand more from the data than a single pass can give. The gains keep coming and keep getting smaller, and there is a floor.

Loss is in "nats." What does a loss of 2 actually mean?
Nats are the natural-log version of bits (one nat ≈ 1.44 bits). A cross-entropy of L nats is a perplexity of e^L: the model is, on average, as unsure as if it were guessing uniformly among that many equally likely next tokens. A loss of 2 nats is a perplexity near 7.4.

Footnotes & further reading

The paper: Kaplan, McCandlish, Henighan, Brown, Chess, Child, Gray, Radford, Wu, Amodei, Scaling Laws for Neural Language Models (OpenAI, 2020).
The follow-up that corrected the compute-optimal split: Hoffmann et al., Training Compute-Optimal Large Language Models (Chinchilla, DeepMind 2022). We have an explainer of Chinchilla too.
The two reconciliations of the Kaplan / Chinchilla gap: Pearce & Song, Reconciling Kaplan and Chinchilla Scaling Laws (2024), and Porian, Teterwak, Carmon et al., Resolving Discrepancies in Compute-Optimal Scaling of Language Models (NeurIPS 2024).
The WebText2 corpus and the reversible byte-pair tokenizer come from the GPT-2 work: Radford et al., Language Models are Unsupervised Multitask Learners (2019).
The bet these curves enabled: Brown et al., Language Models are Few-Shot Learners (GPT-3, 2020), and our GPT-3 explainer.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.