LLMs · Scaling

Language Models are Few-Shot Learners

Q: Does few-shot learning actually update the weights?

No. Nothing about the model changes between zero-shot and few-shot. The demonstrations are more tokens in the prompt, read in the same forward pass. The paper calls the weight-changing version fine-tuning and does not do it for GPT-3.

Q: So is the model learning the task, or recognizing one it already saw?

The paper cannot tell, and says so. Few-shot ability lives on a spectrum from recognizing a task seen in pre-training to adapting to a new one. Word-unscrambling and nonsense-word tasks look learned on the spot; translation must have been learned in pre-training. Which end a given task sits on is open.

Q: Why does making the model bigger help so much?

Pre-training loss follows a smooth power law in scale (Kaplan et al.), and lower loss tracks better downstream performance. GPT-3 extends that line two orders of magnitude with little deviation. In-context learning follows the same trend: the few-shot curve climbs faster than zero-shot, so the gap between them widens as the model grows.

Make it big enough and a few examples in the prompt teach it a new task.

GPT-3 is a 175-billion-parameter next-token predictor. Show it a few examples in the prompt and it does a new task, with no training. Notably, this ability barely exists in small models and arrives with scale.

Explaining the paperLanguage Models are Few-Shot LearnersBrown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, et al. (OpenAI) · NeurIPS 2020 (Best Paper) · arXiv:2005.14165 ↗

Adapting a model to a new task used to mean training it. Here it means writing a prompt.

By 2020 the recipe for a new language task was settled. Take a pretrained model, collect thousands to hundreds of thousands of labeled examples for your task, and fine-tune: keep training until the weights drift to fit it. It worked well. It also meant a fresh labeled dataset and a fresh model for every task you cared about, and the paper notes a subtler cost. A big model fine-tuned on a narrow dataset can latch onto quirks of that dataset and look better on the benchmark than it really is, because the comparison was never quite fair.

GPT-3 (from OpenAI) takes a different route. There is one model. You never change its weights. To do a task you write a prompt: a few worked examples followed by the question you want answered, all as plain text. The model reads it and continues it. The paper calls this in-context learning, and it gets dramatically better as the model gets bigger. A small model barely does it. A 175-billion-parameter model does it well enough to rival fine-tuned systems on some tasks, having seen the task only in its prompt.

The rest of the page builds that up: what a language model computes, what it means to learn from the prompt, why scale was the lever, and what the giant could and could not do.

The cost of a model per task

The dominant recipe, pre-train then fine-tune, gives you a task-agnostic architecture and then specializes it with gradient descent on task-specific data. The paper lays out three problems with it.

First, practicality. There is a wide range of useful language tasks, from fixing grammar to critiquing a short story, and most of them have no large labeled dataset sitting around. Collecting one for every new task is expensive, and you do it again each time.

Second, generalization. The more expressive the model and the narrower the fine-tuning set, the more room there is to exploit spurious correlations that hold on the benchmark and nowhere else. A model that scores at human level on a dataset may be much worse on the actual underlying task.

Third, the comparison to people. A human picks up a new language task from a short instruction or a couple of examples. We do not hand someone ten thousand labeled sentences to teach them to spot sarcasm. If the goal is broadly useful systems, the fine-tuning recipe is a strange fit for how the target ability is supposed to look.

GPT-3 pushes everything into the prompt and leaves the weights alone. To understand what that buys, we first need to be precise about what the model is.

A language model predicts the next token

At its core a language model does one thing: given a stretch of text, it predicts what comes next. Text is first chopped into tokens (sub-word pieces, via byte-level byte-pair encoding inherited from GPT-2), so a sequence is a list of token ids $x_1, x_2, \dots, x_T$ . The model is autoregressive: it factorizes the probability of the whole sequence into a product of next-token probabilities, each conditioned on everything before it.

p_\theta(x_1, \dots, x_T) = \prod_{t=1}^{T} p_\theta\!\left(x_t \mid x_1, \dots, x_{t-1}\right)

(1)

Read it left to right. To score a sentence, score the first token, then the second given the first, then the third given the first two, and multiply. Each factor $p_\theta(x_t \mid x_{<t})$ is a full probability distribution over the vocabulary. The network emits one raw score per vocabulary token (a logit), and a softmax turns those scores into probabilities that sum to one. Training maximizes this likelihood on a giant pile of text, which is the same as minimizing the average cross-entropy, the negative log-probability the model assigned to each true next token:

\mathcal{L}(\theta) = -\frac{1}{T}\sum_{t=1}^{T} \log p_\theta\!\left(x_t \mid x_{<t}\right)

(2)

That quantity is measured in nats per token. A loss near $\log V$ (for vocabulary size $V$ ) means the model is guessing uniformly at random. Driving it down means the model puts more of its probability on the token that actually came next. There is no task here and no labels beyond the text itself.

Figure 1 · next-token prediction

step 3/6 · p=0.34

The model reads the prefix and emits a distribution over the next token. The true next token is in teal; the loss is its negative log-probability, shown below the bars. Step through the sentence and watch the prefix grow. Training does one thing: make the teal bar taller, everywhere, on a trillion words.

The training objective, in code, is a few lines:

# the autoregressive objective: maximize the next-token likelihood
# x is a sequence of tokens [x_1, ..., x_T]
loss = 0
for t in range(1, T):
    logits = model(x[:t])          # read the prefix x_1..x_{t-1}, in one pass
    p = softmax(logits)            # distribution over the whole vocabulary
    loss += -log(p[x[t]])          # penalize surprise at the true next token
loss = loss / (T - 1)              # average cross-entropy (nats per token)

The model conditions on a fixed-length window of recent tokens, called the context. GPT-3's context window is $n_\text{ctx} = 2048$ tokens for every model size. Anything you want the model to use, instructions, examples, the question, has to fit inside those 2048 tokens, and everything competes for that budget: each demonstration you add spends tokens from the same fixed 2048, leaving fewer for the instructions and the question, which is what caps how many examples a few-shot prompt can hold.

In-context learning: specify the task in the prompt

The rest of the paper depends on one mechanism. Because the model continues whatever text you give it, you can specify a task inside the text and let the forward pass do the rest. Want translation? Write a few lines of the form English: ... French: ..., then a final English line with the French left blank, and let the model complete it. No gradients, no fine-tuning. The demonstrations are conditioning, not training data.

The paper names points on a spectrum by how many demonstrations $K$ the prompt contains:

Zero-shot ( $K=0$ ): only a natural-language instruction, no examples.
One-shot ( $K=1$ ): an instruction plus a single demonstration. This is closest to how you would brief a person.
Few-shot ( $K$ in the tens): as many demonstrations as fit the window, typically 10 to 100.

All three are the same operation: build a string, run one forward pass, read the completion. The only thing that changes is how much of the 2048-token window is spent on examples. This is a different thing from fine-tuning, where you would run backpropagation and change $\theta$ . In-context learning never touches $\theta$ .

Made concrete, a one-shot English-to-French prompt is this string, with one demonstration pair and then a new source line left for the model to finish:

Translate English to French:

English: sea otter
French: loutre de mer

English: cheese
French:

The model reads it and continues with fromage, because the format makes that the expected next token. The example has a cost. The instruction line, the one demonstration pair, and the new source line all live in the prompt, so they all spend tokens from the same fixed 2048. This little prompt is a dozen tokens; the budget is barely touched. But the demonstrations consume most of the tokens, and longer ones (a paragraph of context, a worked solution) add up fast. Every pair you add to push past one-shot eats into the same 2048, which is the ceiling on how many demonstrations a few-shot prompt can hold.

Figure 2 · in-context learning

shots Kfew-shot · K=8

A few-shot prompt is K demonstrations followed by the final query, packed into the fixed 2048-token window and read in one forward pass. Drag K. Accuracy (illustrative) climbs fast then flattens, the shape the paper reports, until the window fills and there is no room for more examples.

The mechanism, in code, is string concatenation:

# few-shot prompt = K demonstrations, then the query (no completion)
prompt = ""
for (ctx, completion) in demos[:K]:        # K examples drawn from the task
    prompt += ctx + " " + completion + "\n"
prompt += query_ctx + " "                   # the example we want answered

answer = model.generate(prompt)            # forward pass only; weights frozen
# K = 0 is zero-shot, K = 1 is one-shot, K in 10..100 is few-shot

Why would this work at all? The paper frames it as meta-learning, but the underlying reason is simple. A trillion words of text contain endless repeated formats, Q-and-A pairs, lists, translations, code, and predicting the next token inside such a passage forces the model to infer the format from the first few items, because the next token depends on it. Pre-training on a trillion words therefore builds a broad set of skills and patterns, among them the habit of recognizing a format and continuing it. At test time, a prompt with examples is one more format, defined on the spot, and the model continues it. The paper calls the slow accumulation of skills during pre-training the outer loop, and the fast recognition-and-continuation inside a single forward pass the inner loop, the part they name in-context learning.

That framing also offers a story for why the ability grows with scale, and the paper presents it as an interpretation rather than a proven mechanism. Pre-training on a trillion tokens exposes the model to the same handful of formats over and over: question-and-answer pairs, bulleted lists, source-and-translation lines, code with its comments. A large model has the capacity to internalize enough of those formats that a few examples are enough to pin down which one it is looking at, so it recognizes "this is the format" and continues it. A small model predicts tokens fluently but never built that library, so the same few examples leave it guessing. On this reading the few-shot gain is not the model learning a new task in the prompt, it is a big model matching the prompt to a format it already knows, which is why the gain barely shows up until the model is large. The paper is careful not to claim this as established, in keeping with its general hedge about what emerges with scale.

Why scale: loss follows a power law

In-context learning was not new. GPT-2 had shown flickers of it the year before, and the results were far behind fine-tuning. GPT-3 bets the ability grows with scale rather than staying a small-model curiosity. The reason to believe that comes from a separate line of work on how loss behaves as you scale up.

Kaplan et al. (2020) found that the validation loss of a Transformer language model falls as a clean power law in scale. Given generous data and compute, loss versus the number of (non-embedding) parameters $N$ obeys

L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \qquad \alpha_N \approx 0.076,\quad N_c \approx 8.8\times 10^{13}

(3)

and an analogous law holds for compute along Kaplan's compute-efficient frontier, $L(C) = (C_c/C)^{\alpha_C}$ , $\alpha_C \approx 0.050$ , $C_c \approx 3.1\times 10^8$ petaflop/s-days. A power law is a straight line on log-log axes. The exponents are small, so loss falls slowly, but it keeps falling, predictably, with no sign of a wall. GPT-3 first shows that this line holds for two more orders of magnitude of compute past where Kaplan fit it, with only slight deviation.

Figure 3 · smooth scaling

model175B · 3.6k PF-d

Validation loss versus training compute, on log-log axes, where a power law is a straight line. The eight GPT-3 sizes sit on the dashed power-law line (Kaplan's fitted compute law, α_C ≈ 0.050). Drag along the ladder. Loss keeps dropping along the same line across four orders of magnitude of compute. Real per-model compute is from the paper's Table D.1. Toggle loss + downstream to overlay aggregate few-shot ability against the same compute, redrawn qualitatively from the paper's 42-benchmark aggregate (the Figure 1.3 family, normalized to share the canvas). The two climb together, which is why lower loss tracks better downstream ability.

Why does a loss curve matter for tasks? The loss measures exactly this: every gap in grammar, facts, and structure costs probability on some next token somewhere, so pushing the loss down forces broader internal competence, and few-shot tasks draw on exactly that competence. The overlay in Figure 3, toggled to loss + downstream, shows the two climbing together against the same compute. In-context learning means absorbing skills into the weights and then deploying them at inference, so the paper's expectation is that it follows the same smooth trend, and the only way to test that was to build the biggest model anyone had built and measure.

Few-shot ability emerges with scale

Across the 42 benchmarks the paper aggregates, accuracy in all three settings rises with model size, which is expected. The few-shot curve climbs faster than the zero-shot curve, so the gap between them widens as the model grows. A small model gains almost nothing from having examples in its prompt. A large model gains a lot. That widening gap shows that larger models are better at learning in context, not only better across the board.

Figure 4 · few-shot emerges with scale

model175B · gap 13pts

Aggregate accuracy over 42 benchmarks versus model size, for few-shot, one-shot, and zero-shot. Drag the slider. All three rise, but few-shot rises fastest, so the teal-to-gray gap widens. The curves follow the shape of the paper's Figure 1.3; the values are illustrative.

Concretely, on closed-book TriviaQA the full model scores 64.3% zero-shot, 68.0% one-shot, and 71.2% few-shot, the last of which beat the fine-tuned state of the art in the same closed-book setting at the time. On CoQA it reaches 81.5 / 84.0 / 85.0 F1 across the three settings. The examples in the prompt are worth real points, and they are worth more the bigger the model.

The model: 175 billion parameters

The architecture is deliberately conventional. GPT-3 reuses GPT-2's Transformer almost unchanged: the same pre-normalization, the same modified initialization, the same reversible byte-level tokenizer. The one structural change is that GPT-3 alternates dense attention with locally-banded sparse attention in its layers, following the Sparse Transformer. The paper is about scale, not a new architecture, so they hold the design fixed and turn the size dial.

They train eight models spanning three orders of magnitude, from 125 million to 175 billion parameters, so they can watch behavior as a function of size. The widths and depths grow together. The feed-forward layer is always four times the model width, $d_\text{ff} = 4\,d_\text{model}$ , and the per-head dimension ranges from 64 to 128 across the lineup (128 in most of the larger models). Every model uses the same 2048-token context window and trains on the same 300 billion tokens.

Model	params	layers	d_model	heads	batch
Small	125M	12	768	12	0.5M
Medium	350M	24	1024	16	0.5M
Large	760M	24	1536	16	0.5M
XL	1.3B	24	2048	24	1M
2.7B	2.7B	32	2560	32	1M
6.7B	6.7B	32	4096	32	2M
13B	13B	40	5120	40	2M
175B	175B	96	12288	96	3.2M

The full model has 96 layers, a width of 12,288, and 96 attention heads of dimension 128. Its training cost, from the paper's own accounting, was about 3,640 petaflop/s-days (one petaflop/s-day is a day of computing at $10^{15}$ operations per second), which is roughly $3.1\times 10^{23}$ floating-point operations. That figure has a clean back-of-envelope form: training takes about $6$ flops per parameter per token,

C \approx 6\,N D = 6 \times (175\times 10^9) \times (300\times 10^9) \approx 3.15\times 10^{23}\ \text{flops}

(4)

where $N$ is parameters and $D$ is training tokens. The factor of 6 is 2 for the multiply-and-add in the forward pass times a factor of 3 for the backward pass. The paper also flags a quirk of its own data: the 300 billion training tokens were not sampled in proportion to corpus size. Higher-quality sources were upweighted, so filtered Common Crawl (410 billion tokens) was seen less than once on average while Wikipedia (3 billion tokens) was seen about 3.4 times.

On-the-fly reasoning, tested

The most striking results are the synthetic tasks the authors invented specifically to test on-the-fly reasoning, things very unlikely to sit verbatim in the training data. Arithmetic is the clearest example. They asked GPT-3 questions like Q: What is 48 plus 76? A: and scored exact-match on 2,000 random instances per task.

In the few-shot setting the full 175B model gets 100% on 2-digit addition, 98.9% on 2-digit subtraction, and 80.4% on 3-digit addition. Accuracy falls as the digits grow (about 25 to 27% on 4-digit, 9 to 10% on 5-digit) and 2-digit multiplication sits at 29.2%. None of this is memorization: they searched the training set for the test problems and found matches for under 1% of them, and the model's wrong answers look like arithmetic slips, such as forgetting to carry a 1.

Figure 5 · arithmetic by setting

promptfew-shot

GPT-3 175B on the 10-task arithmetic battery, the exact accuracies from Table 3.9. Two- and three-digit addition and subtraction are near-perfect in few-shot; harder operations are far lower. Drag the slider from zero-shot to few-shot and every task lifts, with no change to the weights, only the prompt.

The arithmetic also shows the scale effect at its sharpest. There is a large jump from the second-largest model (13B) to the full 175B. The 13B model solves 2-digit addition and subtraction only about half the time and scores under 10% on everything else. The capability appears near the top of the size ladder, which is exactly the pattern the paper is pointing at. The paper reports the cliff without a mechanism and does not adjudicate one; an interpretation, not a finding, is capacity: the smaller models never develop the internal procedure for multi-digit arithmetic at all, while the largest one does, which would make the ability look binary even though the loss curve underneath is smooth.

The same story repeats elsewhere. GPT-3 does word unscrambling, uses a freshly-defined nonsense word in a sentence after seeing it once, and translates, with one-shot and few-shot well ahead of zero-shot. And it generates news articles people cannot reliably flag as machine-written. In the paper's human evaluation, participants distinguished GPT-3 175B's ~200-word articles from real ones with mean accuracy of about 52%, where 50% is chance, down from about 86% on a deliberately weak control model. Longer ~500-word articles gave the same result, still near chance.

Contamination and limits

A model trained on much of the web raises an obvious worry: maybe it does well on a benchmark because the benchmark was in the training data. The paper takes this seriously and runs a systematic contamination study, building a "clean" version of each benchmark with train-test overlaps removed and re-scoring on it. They are candid that a filtering bug let some overlaps through and that retraining the model was too expensive to redo. For most benchmarks the clean-subset performance barely moved, so contamination had little effect. A few were flagged: results on PIQA and Winograd carry an asterisk in the paper, and several Wikipedia language modeling benchmarks were dropped entirely because they were almost fully contained in the training set.

The authors list the limitations plainly. Generated text still loses coherence and contradicts itself over long passages, and some task types stay near chance even few-shot, in particular "comparison" tasks like deciding whether one sentence entails another (ANLI) or whether a word is used the same way in two sentences (WiC). The authors suspect the autoregressive, one-direction-only objective hurts here, since these tasks reward looking back and forth, and conjecture a bidirectional model at this scale would do better when fine-tuned. The model is expensive to run. Its sample efficiency during pre-training is poor, it sees far more text than a human ever does. And a fundamental open question remains: does few-shot prompting teach a genuinely new task at inference time, or surface one already learned during pre-training? The paper does not claim to know.

After the caveats, the core is small. There is no new architecture and no new training objective. There is the same next-token prediction, run at a scale nobody had tried, and an ability that small models lack shows up at the top of the curve. Adapting the model stopped meaning "collect a dataset and train" and started meaning "write a prompt." The years since have been built on that shift more than on the parameter count.

Provenance Verified against primary literature

GPT-3 (2020)Brown et al.: the 175B model, the zero/one/few-shot setup, Table 2.1 (8 model sizes), Table 3.9 (arithmetic), the contamination study, and the news-article human evaluation.

Scaling laws (2020)Kaplan et al.: the power laws GPT-3 extends. We use their verified compute-efficient-frontier constants α_C ≈ 0.050, C_c ≈ 3.1e8 PF-days.

GPT-2 (2019)Radford et al.: the architecture GPT-3 reuses (pre-normalization, modified init, byte-level BPE) and the in-context-learning idea GPT-3 scales up.

Sparse Transformer (2019)Child et al.: the alternating dense / locally-banded sparse attention GPT-3 uses in its layers.

correctionTable 2.1 lists d_model = 5140 for GPT-3 13B, which is not divisible by its 40 heads. The intended value is 5120 (40 heads × 128 d_head), an apparent typo in the published table. The prose also reports 80.2% on 3-digit addition while Table 3.9 lists 80.4%; we follow the table.

Questions you might still have

Does few-shot learning actually update the weights?
No. Nothing about the model changes between zero-shot and few-shot. The demonstrations are more tokens in the prompt, read in the same forward pass. The paper calls the weight-changing version fine-tuning and does not do it for GPT-3.

So is the model learning the task, or recognizing one it already saw?
The paper cannot tell, and says so. Few-shot ability lives on a spectrum from recognizing a task seen in pre-training to adapting to a new one. Word-unscrambling and nonsense-word tasks look learned on the spot; translation must have been learned in pre-training. Which end a given task sits on is open.

Why does making the model bigger help so much?
Pre-training loss follows a smooth power law in scale (Kaplan et al.), and lower loss tracks better downstream performance. GPT-3 extends that line two orders of magnitude with little deviation. In-context learning follows the same trend: the few-shot curve climbs faster than zero-shot, so the gap between them widens as the model grows.

Footnotes & further reading

The paper: Brown, Mann, Ryder, Subbiah, Kaplan, Dhariwal, et al. (OpenAI), Language Models are Few-Shot Learners (NeurIPS 2020, Best Paper). Model sizes are Table 2.1; arithmetic is Table 3.9; per-model compute is Table D.1.
The scaling laws GPT-3 extends: Kaplan, McCandlish, et al., Scaling Laws for Neural Language Models. The constants α_N ≈ 0.076, N_c ≈ 8.8e13 and α_C ≈ 0.050, C_c ≈ 3.1e8 PF-days are from its abstract and Section 1.
The architecture GPT-3 reuses: Radford et al., Language Models are Unsupervised Multitask Learners (GPT-2), which also introduced the in-context-learning idea GPT-3 scales.
The sparse attention pattern: Child, Gray, Radford, Sutskever, Generating Long Sequences with Sparse Transformers.
The Transformer GPT-2 and GPT-3 are built on: Vaswani et al., Attention Is All You Need.
The compute-optimal follow-up that revisits Kaplan's exponents: Hoffmann et al., Training Compute-Optimal Large Language Models (Chinchilla), which argues GPT-3 was undertrained for its size.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.