VerifiedarXiv:2005.1416520 min
LLMs · Scaling

Language Models are Few-Shot Learners

Make it big enough and the prompt becomes the program.

GPT-3 is a 175-billion-parameter next-token predictor. Show it a few examples in the prompt and it does a new task, with no training. The surprise is that this ability barely exists in small models and arrives with scale.

Explaining the paperLanguage Models are Few-Shot LearnersBrown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, et al. (OpenAI) · NeurIPS 2020 (Best Paper) · arXiv:2005.14165

What if adapting a language model to a new task meant writing a prompt instead of training a model?

By 2020 the recipe for a new language task was settled. Take a pretrained model, collect thousands to hundreds of thousands of labeled examples for your task, and fine-tune: keep training until the weights drift to fit it. It worked well. It also meant a fresh labeled dataset and a fresh model for every task you cared about, and the paper notes a subtler cost. A big model fine-tuned on a narrow dataset can latch onto quirks of that dataset and look better on the benchmark than it really is, because the comparison was never quite fair.

GPT-3 (from OpenAI) takes a different route. There is one model. You never change its weights. To do a task you write a prompt: a few worked examples followed by the question you want answered, all as plain text. The model reads it and continues it. The paper calls this in-context learning, and the headline finding is that it gets dramatically better as the model gets bigger. A small model barely does it. A 175-billion-parameter model does it well enough to rival fine-tuned systems on some tasks, having seen the task only in its prompt.

To see why that is surprising and how it works, we build up in order: what a language model computes, what it means to learn from the prompt, why scale was the lever, and what the giant could and could not do.

The cost of a model per task

Start with the thing GPT-3 is reacting against. The dominant paradigm, pre-train then fine-tune, gives you a task-agnostic architecture and then specializes it with gradient descent on task-specific data. The paper lays out three reasons that grates.

First, practicality. There is a wide range of useful language tasks, from fixing grammar to critiquing a short story, and most of them have no large labeled dataset sitting around. Collecting one for every new task is expensive, and you do it again each time.

Second, generalization. The more expressive the model and the narrower the fine-tuning set, the more room there is to exploit spurious correlations that hold on the benchmark and nowhere else. A model that scores at human level on a dataset may be much worse on the actual underlying task.

Third, the comparison to people. A human picks up a new language task from a short instruction or a couple of examples. We do not hand someone ten thousand labeled sentences to teach them to spot sarcasm. If the goal is broadly useful systems, the fine-tuning recipe is a strange fit for how the target ability is supposed to look.

GPT-3's answer is to push everything into the prompt and leave the weights alone. To understand what that buys, we first need to be precise about what the model is.

A language model predicts the next token

Strip away the mystique and a language model does one thing: given a stretch of text, it predicts what comes next. Text is first chopped into tokens (sub-word pieces, via byte-level byte-pair encoding inherited from GPT-2), so a sequence is a list of token ids x1,x2,,xTx_1, x_2, \dots, x_T. The model is autoregressive: it factorizes the probability of the whole sequence into a product of next-token probabilities, each conditioned on everything before it.

pθ(x1,,xT)=t=1Tpθ ⁣(xtx1,,xt1)p_\theta(x_1, \dots, x_T) = \prod_{t=1}^{T} p_\theta\!\left(x_t \mid x_1, \dots, x_{t-1}\right)(1)

Read it left to right. To score a sentence, score the first token, then the second given the first, then the third given the first two, and multiply. Each factor pθ(xtx<t)p_\theta(x_t \mid x_{<t}) is a full probability distribution over the vocabulary. The network emits one raw score per vocabulary token (a logit), and a softmax turns those scores into probabilities that sum to one. Training maximizes this likelihood on a giant pile of text, which is the same as minimizing the average cross-entropy, the negative log-probability the model assigned to each true next token:

L(θ)=1Tt=1Tlogpθ ⁣(xtx<t)\mathcal{L}(\theta) = -\frac{1}{T}\sum_{t=1}^{T} \log p_\theta\!\left(x_t \mid x_{<t}\right)(2)

That quantity is measured in nats per token. A loss near logV\log V (for vocabulary size VV) means the model is guessing uniformly at random. Driving it down means the model puts more of its probability on the token that actually came next. There is no task here and no labels beyond the text itself. The objective is "predict the next token," full stop.

Figure 1 · next-token prediction
step 3/6 · p=0.34
The model reads the prefix and emits a distribution over the next token. The true next token is in teal; the loss is its negative log-probability, shown top-right. Step through the sentence and watch the prefix grow. Training does one thing: make the teal bar taller, everywhere, on a trillion words.

Stripped to its core, the entire training objective is a few lines:

# the autoregressive objective: maximize the next-token likelihood
# x is a sequence of tokens [x_1, ..., x_T]
loss = 0
for t in range(1, T):
    logits = model(x[:t])          # read the prefix x_1..x_{t-1}, in one pass
    p = softmax(logits)            # distribution over the whole vocabulary
    loss += -log(p[x[t]])          # penalize surprise at the true next token
loss = loss / (T - 1)              # average cross-entropy (nats per token)

One detail to keep. The model conditions on a fixed-length window of recent tokens, called the context. GPT-3's context window is nctx=2048n_\text{ctx} = 2048 tokens for every model size. Anything you want the model to use, instructions, examples, the question, has to fit inside those 2048 tokens. That constraint is about to become load-bearing.

In-context learning: the prompt is the program

The rest of the paper turns on one move. Because the model continues whatever text you give it, you can specify a task inside the text and let the forward pass do the rest. Want translation? Write a few lines of the form English: ... French: ..., then a final English line with the French left blank, and let the model complete it. No gradients, no fine-tuning. The demonstrations are conditioning, not training data.

The paper names points on a spectrum by how many demonstrations KK the prompt contains:

All three are the same operation: build a string, run one forward pass, read the completion. The only thing that changes is how much of the 2048-token window is spent on examples. This is a different thing from fine-tuning, where you would run backpropagation and change θ\theta. In-context learning never touches θ\theta.

Figure 2 · in-context learning
few-shot · K=8
A few-shot prompt is K demonstrations followed by the final query, packed into the fixed 2048-token window and read in one forward pass. Drag K. Accuracy (illustrative) climbs fast then flattens, the shape the paper reports, until the window fills and there is no room for more examples.

The whole mechanism, in code, is string concatenation:

# few-shot prompt = K demonstrations, then the query (no completion)
prompt = ""
for (ctx, completion) in demos[:K]:        # K examples drawn from the task
    prompt += ctx + " " + completion + "\n"
prompt += query_ctx + " "                   # the example we want answered

answer = model.generate(prompt)            # forward pass only; weights frozen
# K = 0 is zero-shot, K = 1 is one-shot, K in 10..100 is few-shot

Why would this work at all? The framing the paper offers is meta-learning. During pre-training, predicting the next token across a trillion words forces the model to pick up a broad set of skills and patterns. Among those patterns are tasks that recur inside a single passage, like a list of translations or a Q-and-A format. The model learns to recognize and continue such patterns. At test time, your prompt is one more such pattern, and the model continues it. The paper calls the slow accumulation of skills during pre-training the outer loop, and the fast recognition-and-continuation inside a single forward pass the inner loop, the part they name in-context learning.

Why scale: loss follows a power law

In-context learning was not new. GPT-2 had shown flickers of it the year before, and the results were far behind fine-tuning. GPT-3 bets the ability grows with scale rather than staying a small-model curiosity. The reason to believe that comes from a separate line of work on how loss behaves as you scale up.

Kaplan et al. (2020) found that the validation loss of a Transformer language model falls as a clean power law in scale. Hold data and compute generous, and loss versus the number of (non-embedding) parameters NN obeys

L(N)=(NcN)αN,αN0.076,Nc8.8×1013L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \qquad \alpha_N \approx 0.076,\quad N_c \approx 8.8\times 10^{13}(3)

and an analogous law holds for training compute CC, with L(C)=(Cc/C)αCL(C) = (C_c/C)^{\alpha_C}, αC0.050\alpha_C \approx 0.050, Cc3.1×108C_c \approx 3.1\times 10^8 petaflop/s-days. A power law is a straight line on log-log axes. The exponents are small, so loss falls slowly, but it keeps falling, predictably, with no sign of a wall. GPT-3's first result is that this line holds for two more orders of magnitude of compute past where Kaplan fit it, with only slight deviation.

Figure 3 · smooth scaling
175B · 3.6k PF-d
Validation loss versus training compute, on log-log axes, where a power law is a straight line. The eight GPT-3 sizes sit on the dashed power-law line (Kaplan's fitted compute law, αC ≈ 0.050). Drag along the ladder. Loss keeps dropping along the same line across four orders of magnitude of compute. Real per-model compute is from the paper's Table D.1.

Why does a loss curve matter for tasks? Because lower next-token loss tracks better performance on real language tasks. The paper's wager is that since in-context learning means absorbing skills into the weights and then deploying them at inference, it should ride the same smooth trend. Spend more compute, get lower loss, get a better in-context learner. The only way to test that was to build the biggest model anyone had built and measure.

The bet: few-shot ability emerges with scale

Across the 42 benchmarks the paper aggregates, accuracy in all three settings rises with model size, which is expected. The shape is the telling part. The few-shot curve climbs faster than the zero-shot curve, so the gap between them widens as the model grows. A small model gains almost nothing from having examples in its prompt. A large model gains a lot. That widening gap is the paper's evidence that larger models are better at learning in context, not only better across the board.

Figure 4 · few-shot emerges with scale
175B · gap 13pts
Aggregate accuracy over 42 benchmarks versus model size, for few-shot, one-shot, and zero-shot. Drag the slider. All three rise, but few-shot rises fastest, so the teal-to-gray gap widens. The curves follow the shape of the paper's Figure 1.3; the values are illustrative.

Concretely, on closed-book TriviaQA the full model scores 64.3% zero-shot, 68.0% one-shot, and 71.2% few-shot, the last of which beat the fine-tuned state of the art in the same closed-book setting at the time. On CoQA it reaches 81.5 / 84.0 / 85.0 F1 across the three settings. The examples in the prompt are worth real points, and they are worth more the bigger the model.

The model: 175 billion parameters

The architecture is deliberately boring. GPT-3 reuses GPT-2's Transformer almost unchanged: the same pre-normalization, the same modified initialization, the same reversible byte-level tokenizer. The one structural change is that GPT-3 alternates dense attention with locally-banded sparse attention in its layers, following the Sparse Transformer. The point of the paper is scale, not a new architecture, so they hold the design fixed and turn the size dial.

They train eight models spanning three orders of magnitude, from 125 million to 175 billion parameters, so they can watch behavior as a function of size. The widths and depths grow together. The feed-forward layer is always four times the model width, dff=4dmodeld_\text{ff} = 4\,d_\text{model}, and the per-head dimension stays near 128. Every model uses the same 2048-token context window and trains on the same 300 billion tokens.

Modelparamslayersd_modelheadsbatch
Small125M12768120.5M
Medium350M241024160.5M
Large760M241536160.5M
XL1.3B242048241M
2.7B2.7B322560321M
6.7B6.7B324096322M
13B13B405120402M
175B175B9612288963.2M

The full model has 96 layers, a width of 12,288, and 96 attention heads of dimension 128. Its training cost, from the paper's own accounting, was about 3,640 petaflop/s-days (one petaflop/s-day is a day of computing at 101510^{15} operations per second), which is roughly 3.1×10233.1\times 10^{23} floating-point operations. That figure has a clean back-of-envelope form: training takes about 66 flops per parameter per token,

C6ND=6×(175×109)×(300×109)3.15×1023 flopsC \approx 6\,N D = 6 \times (175\times 10^9) \times (300\times 10^9) \approx 3.15\times 10^{23}\ \text{flops}(4)

where NN is parameters and DD is training tokens. The factor of 6 is 2 for the multiply-and-add in the forward pass times a factor of 3 for the backward pass. One detail the paper flags about its own data: the 300 billion training tokens were not sampled in proportion to corpus size. Higher-quality sources were upweighted, so filtered Common Crawl (410 billion tokens) was seen less than once on average while Wikipedia (3 billion tokens) was seen about 3.4 times.

What it actually does

The most striking results are the synthetic tasks the authors invented specifically to test on-the-fly reasoning, things very unlikely to sit verbatim in the training data. Arithmetic is the cleanest. They asked GPT-3 questions like Q: What is 48 plus 76? A: and scored exact-match on 2,000 random instances per task.

In the few-shot setting the full 175B model gets 100% on 2-digit addition, 98.9% on 2-digit subtraction, and 80.4% on 3-digit addition. Accuracy falls as the digits grow (about 25 to 27% on 4-digit, 9 to 10% on 5-digit) and 2-digit multiplication sits at 29.2%. None of this is memorization: they searched the training set for the test problems and found matches for under 1% of them, and the model's wrong answers look like arithmetic slips, such as forgetting to carry a 1.

Figure 5 · arithmetic by setting
few-shot
GPT-3 175B on the 10-task arithmetic battery, the exact accuracies from Table 3.9. Two- and three-digit addition and subtraction are near-perfect in few-shot; harder operations are far lower. Drag the slider from zero-shot to few-shot and the whole profile lifts, with no change to the weights, only the prompt.

The arithmetic also shows the scale effect at its sharpest. There is a large jump from the second-largest model (13B) to the full 175B. The 13B model solves 2-digit addition and subtraction only about half the time and scores under 10% on everything else. The capability appears near the top of the size ladder, which is exactly the pattern the paper is pointing at.

The same story repeats elsewhere. GPT-3 does word unscrambling, uses a freshly-defined nonsense word in a sentence after seeing it once, and translates, with one-shot and few-shot well ahead of zero-shot. And it generates news articles people cannot reliably flag as machine-written. In the paper's human evaluation, participants distinguished GPT-3 175B's ~200-word articles from real ones with mean accuracy of about 52%, where 50% is chance, down from about 86% on a deliberately weak control model. Longer ~500-word articles gave the same result, still near chance.

Contamination and limits

A model trained on much of the web raises an obvious worry: maybe it does well on a benchmark because the benchmark was in the training data. The paper takes this seriously and runs a systematic contamination study, building a "clean" version of each benchmark with train-test overlaps removed and re-scoring on it. They are candid that a filtering bug let some overlaps through and that retraining the model was too expensive to redo. For most benchmarks the clean-subset performance barely moved, so contamination had little effect. A few were flagged: results on PIQA and Winograd carry an asterisk in the paper, and several Wikipedia language modeling benchmarks were dropped entirely because they were almost fully contained in the training set.

The limitations are real and the authors list them plainly. Generated text still loses coherence and contradicts itself over long passages. Some task types stay near chance even few-shot, in particular "comparison" tasks like deciding whether one sentence entails another (ANLI) or whether a word is used the same way in two sentences (WiC). The authors suspect the autoregressive, one-direction-only objective hurts here, since these tasks reward looking back and forth, and conjecture a bidirectional model at this scale would do better when fine-tuned. The model is expensive to run, its sample efficiency during pre-training is poor (it sees far more text than a human ever does), and a fundamental open question remains: whether few-shot prompting teaches a genuinely new task at inference time or just surfaces one already learned during pre-training. The paper does not claim to know.

Set the caveats aside and the through-line is simple. There is no new architecture and no clever training objective. There is the same next-token prediction, run at a scale nobody had tried, and an ability that small models lack shows up at the top of the curve. Adapting the model stopped meaning "collect a dataset and train" and started meaning "write a prompt." That shift, more than the parameter count, is what the years since have been built on.

Provenance Verified against primary literature
GPT-3 (2020)Brown et al.: the 175B model, the zero/one/few-shot setup, Table 2.1 (8 model sizes), Table 3.9 (arithmetic), the contamination study, and the news-article human evaluation.
Scaling laws (2020)Kaplan et al.: the power laws GPT-3 extends. We use their verified compute-law constants α_C ≈ 0.050, C_c ≈ 3.1e8 PF-days.
GPT-2 (2019)Radford et al.: the architecture GPT-3 reuses (pre-normalization, modified init, byte-level BPE) and the in-context-learning idea GPT-3 scales up.
Sparse Transformer (2019)Child et al.: the alternating dense / locally-banded sparse attention GPT-3 uses in its layers.
correctionTable 2.1 lists d_model = 5140 for GPT-3 13B, which is not divisible by its 40 heads. The intended value is 5120 (40 heads × 128 d_head), an apparent typo in the published table. The prose also reports 80.2% on 3-digit addition while Table 3.9 lists 80.4%; we follow the table.

Questions you might still have

?

Does few-shot learning actually update the weights?
No. Nothing about the model changes between zero-shot and few-shot. The demonstrations are just more tokens in the prompt, read in the same forward pass. The paper calls the weight-changing version fine-tuning and does not do it for GPT-3.

?

So is the model learning the task, or recognizing one it already saw?
The paper is honest that it cannot tell. Few-shot ability lives on a spectrum from recognizing a task seen in pre-training to adapting to a new one. Word-unscrambling and nonsense-word tasks look learned on the spot; translation must have been learned in pre-training. Which end a given task sits on is open.

?

Why does making the model bigger help so much?
Pre-training loss follows a smooth power law in scale (Kaplan et al.), and lower loss tracks better downstream performance. GPT-3 extends that line two orders of magnitude with little deviation. In-context learning rides the same trend: the few-shot curve climbs faster than zero-shot, so the gap between them widens as the model grows.

Footnotes & further reading

  1. The paper: Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, et al. (OpenAI), Language Models are Few-Shot Learners (NeurIPS 2020, Best Paper). Model sizes are Table 2.1; arithmetic is Table 3.9; per-model compute is Table D.1.
  2. The scaling laws GPT-3 extends: Kaplan, McCandlish, et al., Scaling Laws for Neural Language Models. The constants αN ≈ 0.076, Nc ≈ 8.8e13 and αC ≈ 0.050, Cc ≈ 3.1e8 PF-days are from its abstract and Section 1.
  3. The architecture GPT-3 reuses: Radford et al., Language Models are Unsupervised Multitask Learners (GPT-2), which also introduced the in-context-learning idea GPT-3 scales.
  4. The sparse attention pattern: Child, Gray, Radford, Sutskever, Generating Long Sequences with Sparse Transformers.
  5. The Transformer GPT-2 and GPT-3 are built on: Vaswani et al., Attention Is All You Need.
  6. The compute-optimal follow-up that revisits Kaplan's exponents: Hoffmann et al., Training Compute-Optimal Large Language Models (Chinchilla), which argues GPT-3 was undertrained for its size.