LLMs · Reasoning

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Show a model the steps, and reasoning emerges.

A 540-billion-parameter model, shown eight worked examples and trained on none of the test data, beats a specialist finetuned on the full dataset. Making the model write its steps out loud is enough on its own, and it only starts working once the model is large enough to follow along.

Explaining the paperChain-of-Thought Prompting Elicits Reasoning in Large Language ModelsWei, Wang, Schuurmans, Bosma, et al. · Google Research · NeurIPS 2022 · arXiv:2201.11903 ↗

What got a model that could not reason to reason was not a bigger model or more training. It was eight examples that showed it how to think out loud.

For a few years the recipe for a better language model was simple: make it bigger. Scaling bought a lot, but it did not buy multi-step reasoning. A model that could draft a sonnet would still trip over a two-step word problem a ten-year-old can do. And the usual lever, scale, barely helped: on arithmetic and logic the accuracy-versus-size curve was nearly flat. You could multiply the parameter count and watch the math scores sit exactly where they were.

This paper, out of Google in January 2022, found the missing piece, and it was not in the model. It was in the prompt. You take a large model off the shelf, change nothing about its weights, and instead of showing it example questions paired with bare answers, you show it example questions paired with the reasoning that leads to each answer. The model, continuing the pattern, writes out its own reasoning on the next question, and the accuracy jumps. On the hardest math benchmark in the paper, it more than tripled.

The headline result sounds wrong at first. GSM8K is a set of grade-school math word problems (8,500 of them, each taking two to eight steps). The prior record on it belonged to a 175-billion-parameter model finetuned on the entire training set and paired with a separate trained verifier that re-ranked a hundred candidate answers. That system scored about 55%. A 540-billion-parameter model called PaLM, given eight worked examples in its prompt and trained on none of GSM8K, scored 56.9%.

The authors call the trick chain-of-thought prompting, and a chain of thought is the series of intermediate steps a person would say out loud while solving the problem. The paper is the careful version of "we tried showing the model the working, and it helped a lot, but only past a certain size, and here is everything we did to rule out the boring explanations." To see why it works, and why it is stranger than it first looks, a few ideas carry it: what in-context learning is and why reasoning was its blind spot, what a chain of thought actually changes about the prompt, the deflationary explanations and how the paper kills them, why none of it works until the model is huge, and what the result does and does not tell us about whether the model is "reasoning" at all.

Specifying the task in the prompt

Chain-of-thought is built on one thing. A large language model is, mechanically, a next-token predictor: you feed it a string, it returns a probability distribution over the next token, you pick one, append it, and repeat. Nothing in that loop says "trained for task X." So how do you get it to do a specific task?

GPT-3 (Brown et al., 2020) popularized the answer this paper rides on: in-context learning, also called few-shot prompting. You do not finetune. You write a few demonstrations of the task directly into the prompt, an input paired with its output, then the real input, and let the model complete it. Zero-shot means no demonstrations, one-shot means one, and few-shot means as many as fit in the context window, typically ten to a hundred. The crucial part: no weights change. The demonstrations are just tokens the model reads in a single forward pass, and it infers the pattern from them on the fly.

This is a genuinely good deal, and it is why "prompting" became a field. But it had a conspicuous blind spot. On tasks that need several reasoning steps (arithmetic word problems, logic, symbolic manipulation), few-shot prompting worked poorly, and, more tellingly, it often did not improve much as the model grew. Most capabilities have a friendly upward scaling curve: bigger model, better score. Reasoning did not. You could pour in parameters and the math accuracy would barely move (this is what Rae and colleagues reported for the 280B Gopher model). That flat curve is the problem this paper starts from, and it is half the surprise later: standard prompting on these tasks is nearly flat in scale.

The other half is why nobody had a cheap fix. You could train the reasoning in: collect thousands of problems with worked solutions and finetune, which is exactly what the GSM8K authors did. That works, but it is expensive, specific to one dataset, and it throws away the "one model, many tasks" property that made prompting attractive in the first place. The appeal of what comes next is that it keeps that property.

Chain of thought: show the work

The entire method is one change to the prompt. In standard few-shot prompting, each demonstration is a pair: a question, then the answer. Chain-of-thought prompting makes the demonstration a triple: a question, then the reasoning steps, then the answer. That is all of it. You show the model not just what the answer is but how you got there, and you trust it to imitate the how. The order is load-bearing: the model generates left to right, so everything it writes becomes context for what it writes next. Putting the reasoning before the answer means the answer tokens are produced while the model can literally read its own working; if the answer came first, it would have to commit before doing any of the work.

Figure 1 · the method, toggled

prompt

writes 9

The only thing that changes is the example's answer. Demonstrate a bare answer (standard) and on a new question the model blurts a bare answer, which on a problem that needed steps comes out wrong (27). Demonstrate the working (chain of thought) and it writes the working too, and lands 9. Same model, same question, different example.

A chain of thought is a series of short, natural-language intermediate steps that lead to the final answer. The paper opens with a worked example. The question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? The chain: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11. It reads like a solution, and it is one. The authors call it a chain of thought to stress that it mimics a step-by-step thought process rather than a polished write-up. (Note that real solutions usually come after the answer; a chain of thought comes before it, and that ordering matters.)

The model imitates the demonstrated pattern. The model has seen, in its prompt, that an "A:" is followed by a few sentences of arithmetic and then "The answer is N." So when it reaches the new question's "A:", it does the same: it writes sentences of arithmetic, and only then commits to a number. The cafeteria problem in Figure 1 shows what that changes. With a bare-answer demonstration, the model pattern-matches "produce a number" and blurts 27, which is wrong. With a worked demonstration, it writes 23 minus 20 is 3, they bought 6 more, so 3 + 6 = 9, and answers 9. Same model, same question.

A few specifics, because they matter for trusting the result. The exemplars were written by hand, eight of them, and the same eight were used for every arithmetic benchmark except the multiple-choice AQuA (which got four). They were not tuned: the authors say plainly that these particular exemplars did not undergo prompt engineering, and they later measure how much the exact wording matters (it does not matter much). Decoding is greedy, meaning the model takes its single most likely continuation, with no sampling and no voting. That last detail dates the paper. Sampling many chains and keeping the majority answer ("self-consistency") is a later improvement, and so is the zero-shot version that drops the examples and just appends "Let's think step by step." Both came afterward, and both are different papers.

The artifact itself is the text fed to the model. Standard prompting:

# standard few-shot prompt: each example is question, then bare answer
Q: Roger has 5 balls. He buys 2 cans of 3 balls each. How many now?
A: The answer is 11.

Q: The cafeteria had 23 apples, used 20, and bought 6 more. How many?
A:        # the model continues here -> "The answer is 27."   (wrong)

and the chain-of-thought version, identical except the demonstrated answer shows its work:

# chain-of-thought prompt: question, the working, THEN the answer
Q: Roger has 5 balls. He buys 2 cans of 3 balls each. How many now?
A: Roger started with 5. 2 cans of 3 is 6 balls. 5 + 6 = 11.
   The answer is 11.

Q: The cafeteria had 23 apples, used 20, and bought 6 more. How many?
A:        # continues -> "23 - 20 = 3. 3 + 6 = 9. The answer is 9."

The model reads everything above the final "A:" and continues from there. Change those few demonstration lines and you change what it writes for every question that follows.

Why writing it out could help

It is worth being suspicious before crediting the chain. Several reasons this might work are plausible, and at least three of them turn out to be wrong, or beside the point.

The intuitive story the authors offer has a few parts, and they are reasons to like the method rather than evidence that the chain does the work. A chain of thought lets the model break a multi-step problem into pieces and handle them one at a time, so a harder problem can get more intermediate steps (and so more of the model's computation) than an easy one. It gives a readable window into how the model reached its answer, useful for finding where the reasoning went wrong (they add, carefully, that a plausible-looking chain does not prove the model actually computed that way). And because the steps are just language, the same trick applies to anything you can reason about in words, not only to arithmetic.

Those are nice properties. They are not yet a reason to believe the chain itself matters. There are at least three deflationary explanations that would make the gains far less interesting, and each is a hypothesis to be tested, not a conclusion:

Maybe it is the equation. A worked solution contains the arithmetic expression. Perhaps all the model needs is to be nudged into writing 5 + 6 before answering, and the prose around it is decoration.
Maybe it is the extra compute. A chain is more tokens, and generating more tokens means more sequential computation before the answer commits. Perhaps any filler of the same length would do.
Maybe it is retrieval. Perhaps writing anything before the answer helps the model warm up and surface relevant knowledge, and the specific reasoning is incidental.

Each hypothesis, if true, would deflate the result in a different way: if it is the equation, chain-of-thought is a formatting trick; if it is the extra compute, any filler tokens should work just as well; if it is retrieval, the reasoning is decoration on top of memorization. The finding holds only if all three are false. If the gain were really "the equation" or "more tokens," chain-of-thought would be a parlor trick. The ablations show it is not. But there is a caveat to handle first.

Reasoning emerges with scale

Chain-of-thought prompting does not work on small models. It is worse than useless on them: below roughly 10 billion parameters, adding a chain of thought tends to lower accuracy. The small models produce chains that are fluent, grammatical, and wrong, confidently stated reasoning that reaches a false conclusion, and they end up scoring below the model that just guessed.

Figure 2 · an emergent ability

540B · +39.0

GSM8K solve rate against model size. Standard prompting barely climbs. Chain of thought is flat or worse for small models, then turns sharply upward past ~100B. Only PaLM 540B clears the prior best (a finetuned 175B model plus a verifier, 55%); GPT-3 at 175B gets close. Drag the scale: the gap is negative for small models and enormous for large ones. Numbers are the paper's Table 2.

The benefit appears only with scale, and arrives fast when it does. On GSM8K, PaLM at 8B scores 4.9% with standard prompting and 4.1% with a chain of thought, slightly worse. At 62B it is 9.6% against 29.9%. At 540B it is 17.9% against 56.9%. The standard-prompting curve stays near the floor; the chain-of-thought curve is flat-then-vertical. GPT-3 does the same thing: chain of thought hurts at 350M and at 6.7B, then at 175B it flips a 15.6% into a 46.9%.

This pattern has a name. An ability is emergent (Wei and colleagues, in a companion paper) if it is absent in small models and present in large ones, in a way you could not have predicted by extrapolating the small models' performance. The curve stays near random until some critical scale, then rises sharply, a qualitative change that looks like a phase transition. Chain-of-thought reasoning on math word problems is one of that paper's examples, emerging around $10^{23}$ training FLOPs, very roughly 100 billion parameters.

Two qualifications. The "around 100 billion" figure is specific to these reasoning tasks, not a universal law. The companion paper is explicit that the scale at which an ability emerges depends on the task and the model, ranging from a few billion to several hundred billion across different abilities. And "emergent" here is an empirical description, not a mechanism. Nobody claims a switch flips at 100B. What the data says is narrow and strong: for these problems, the chain-of-thought gain is roughly zero or negative until the model is very large, and then it is big.

The size of the gain also tracks the difficulty of the problem, which is a good sign that something real is going on. The improvement is largest exactly where the baseline is worst. GSM8K, the hardest set with the lowest standard-prompting score, more than doubled for the biggest GPT and PaLM models. On SingleOp, the easiest subset (a single arithmetic operation, which large models already handle), chains of thought help barely or slightly hurt. You do not get a windfall on problems that did not need steps. You get it on the problems that did.

Isolating what actually matters

The three deflationary explanations get tested directly here. The setup is clean: one model (LaMDA 137B), the same problems, and three alternative prompts, each built to supply one suspected ingredient of a chain of thought without the rest.

Figure 3 · the ablations

chain

chain · Reasoning in plain language, then the answer. The full method.

Four lookalike prompts and the real one, all on LaMDA 137B. On GSM8K only chain of thought clears the standard baseline (dashed). Equation only, dots, and answer-first all sit at or below it, which kills the "it is just the equation / just more tokens / just a warm-up" stories. Switch to MAWPS and equation-only helps too, because there the equation is easy to read off the question. Click a bar to see what it tests.

Equation only. Prompt the model to emit just the mathematical expression (the5 + 6 = 11 with none of the words) and then the answer. If the gain were really about getting the equation onto the page, this should recover most of it. On GSM8K it does not: 5.4%, actually below the 6.5% standard baseline. The problems are too tangled to compress straight into an equation without the natural-language steps that set it up. (On the easier MAWPS benchmark, equation-only does help, 50.1% against a 43.2% baseline, because once you have the equation those problems are trivial. So equation-only lands in between, useful when the translation is easy and useless when it is hard. The full chain is the only variant that handles the hard cases.)

Variable compute only. This tests the worry that the win comes from spending more tokens. So prompt the model to output a row of dots, the same number of characters the equation would have, then the answer. Pure compute, the same token budget, zero content. Result: 6.4%, indistinguishable from the 6.5% baseline. Spending more tokens is not the mechanism.

Reasoning after the answer. This tests the worry that the chain merely warms the model up and helps it surface knowledge, with the reasoning incidental. So flip the order: have the model give the answer first and the reasoning afterward. Now the chain cannot inform the answer, since the answer already came out. Result: 6.1%, again at baseline. The reasoning has to come before the answer, where the answer can depend on it.

Taken together, the three ablations point to a sharp conclusion. It is not the equation, it is not the token count, and it is not knowledge-jogging. It is natural-language reasoning, placed before the answer, where the answer can follow from it. That is the one variant that moves the number: 14.3% against a clustered 5 to 6.5% for everything else.

Beyond arithmetic, beyond the examples

The effect is strongest in math, but the method is made of language, so it travels. The paper runs it on two more families of task.

Commonsense reasoning is questions about how the everyday world works: would a pear sink in water, is this sports sentence plausible, what sequence of actions fulfills a spoken request. Here a chain of thought is the model stating the relevant facts before it commits. On Sports Understanding, PaLM 540B with a chain of thought reaches 95.4%, past both the prior system and an unaided human sports fan at 84%. On StrategyQA it clears the prior state of the art. The gain is not universal, and the authors say so: on CSQA it was minimal. Chain of thought helps most where the task genuinely breaks into stated steps.

Symbolic reasoning gives the most revealing evidence, because two toy tasks isolate something the math problems cannot. Last-letter concatenation: take the last letter of each word in a name and string them together. Coin flip: a coin starts heads, some people flip it and some do not, is it still heads? Both are trivial to specify, and both have a property the authors exploit. You can build a test case longer than any example you showed the model. Show it only two-word names, then ask about four-word names. Show it two-person coin flips, then ask about four-person ones.

Figure 4 · length generalization

peopleN = 4 · OOD

A coin starts heads; each person flips it or not. A chain of thought carries the running state down one person at a time, so it answers correctly for more people than it ever saw in the examples. Standard prompting, having only seen two-person cases, guesses past two. Drag past N=2 (the in-domain length) to cross into out-of-domain territory.

This is the cleanest evidence that the model is running a procedure rather than matching complete answers. With standard prompting, the model saw two-step cases and fails the moment you go longer; out of domain it collapses toward chance. With a chain of thought, it keeps the running state and applies the same per-step rule to however many steps you hand it, so it generalizes to lengths it was never shown. PaLM 540B solves the in-domain coin flip essentially perfectly and keeps tracking on longer sequences (accuracy does sag as the chains get long, so this is graceful degradation, not magic). A model that only memorized complete answers for two-flip cases could not do that. Writing the running state out at each step and reading it back at the next can. (The paper shows this behavior; it does not pin down how the model holds the state inside.)

And again the floor is scale. These tasks are toys precisely because the chain in the exemplars already spells out the perfect procedure; all the model has to do is repeat the steps with new symbols. Small models still fail even that. The ability to mechanically apply a shown procedure to unseen symbols is, itself, something that only shows up around 100 billion parameters.

Not a fragile prompt trick

A fair worry about any prompting result is that it is an artifact of one lucky prompt. Few-shot prompting is famously twitchy: there are documented cases where merely reordering the demonstrations swings accuracy from near chance to near state of the art. So is chain of thought just a good roll of the dice on eight hand-picked examples?

The paper checks, and the answer is no. Three different people wrote independent chains of thought for the same questions. A deliberately concise rewrite was tried. And three more sets of eight examples were sampled at random from the GSM8K training data instead of being hand-written. Every one of these prompt sets beats standard prompting by a wide margin. On GSM8K with LaMDA 137B they land between roughly 11% and 18%, all well above the 6.5% baseline. There is real variance between them, as expected for prompting, but the floor of the chain-of-thought results sits comfortably above the ceiling of standard prompting. The effect survives changing the author, the wording, the style, and the specific examples. The method, not one particular prompt, accounts for the gain. (They report it is robust to the order of the examples and to how many you use, as well.)

What it does, and what it does not

With the ablations behind it, the headline result deserves a second look.

Figure 5 · the headline result

GSM8K, four contenders. The prior record was a 175B model finetuned on the whole dataset with a verifier re-ranking 100 sampled answers: 55%. PaLM 540B with eight chain-of-thought examples and no training: 56.9%. The same model with standard prompting: 17.9%. The two on the left were trained on GSM8K; the two on the right were only prompted.

On GSM8K, PaLM 540B with eight chain-of-thought examples and no training scores 56.9%, edging past the 55% the previous record-holder reached by finetuning a 175B model on the full dataset and bolting on a verifier that re-ranked a hundred sampled solutions. The same PaLM with standard prompting manages 18%. The chain accounts for the distance between 18 and 57, and it cost eight hand-written examples.

The authors draw a broader claim. Standard prompting had made these models look like they could not do multi-step reasoning, and that was largely a measurement artifact. The capability was latent; standard prompting did not surface it. Their phrasing: standard prompting gives only a lower bound on what a large model can do. With a different way of asking, the flat scaling curve becomes a steep one. That reframing, more than any single benchmark number, is why this short paper mattered. It suggested that much of "the model cannot do X" was really "we have not asked for X correctly," and it set off a wave of prompting methods: self-consistency, least-to-most, the zero-shot "think step by step," and eventually models trained to produce long chains by default.

The paper is careful to bound its own claim. It does not show the model is "reasoning" in any deep sense. It shows that producing reasoning-shaped text helps, which is not the same claim, and they leave the deeper question open. There is no guarantee the chain is correct: a model can reach the right answer through a broken chain (in one hand-checked sample of 50 correct GSM8K answers from LaMDA, two got there by luck on faulty reasoning) or a wrong answer through a chain that looked fine. The method is also costly to deploy, precisely because it works only on very large models, which are expensive to serve. And while writing eight examples is cheap, building a large labeled set of chains for finetuning would not be. None of this dents the finding.

What lasts is almost too plain to state. The right answers were reachable all along, and the field had been asking for them the wrong way. When a large enough model is shown the working on enough examples, the steps come out and the answers land. Whether that adds up to reasoning is the question the paper leaves open. That the capability was sitting in the weights, waiting on the shape of the prompt, is not.

Provenance Verified against primary literature

GPT-3 (2020)Brown et al.: few-shot / in-context learning, demonstrations in the prompt with no weight updates. The baseline this rides on.

GSM8K (2021)Cobbe et al.: the grade-school math benchmark (8.5K problems, 2 to 8 steps) and the finetuned-175B-plus-verifier system (~55%) that was the prior best.

Emergent abilities (2022)Wei et al.: the "absent in small, present in large" framing, and the caution that the scale threshold is task-specific, not a universal law.

PaLM / LaMDA (2022)Chowdhery et al. and Thoppilan et al.: the 8B / 62B / 540B and up-to-137B models whose scaling curves the paper plots.

correctionTwo numbers to read with care. The paper's text reports 75.6% on StrategyQA while its own Appendix Table 4 lists 77.8% for PaLM 540B (both beat the prior best of 69.4%); we follow the table and note the gap. And the GPT-3 sizes (notably text-davinci-002 at ~175B) are the paper's own stated presumption, not an OpenAI-confirmed figure; text-davinci-002 is InstructGPT-style supervised tuning (instruction-following fine-tuning, see /instructgpt/), not the full RLHF (reinforcement learning from human feedback) pipeline.

Questions you might still have

Does chain-of-thought prompting change the model’s weights?
No. Nothing about the model changes. The worked examples are just extra tokens in the prompt, read in one forward pass. The paper finetunes nothing; a single frozen checkpoint does every task.

Why does it only work for big models?
Below roughly 10B parameters the chains are fluent but illogical, and they lower accuracy. Producing a coherent, correct chain is itself a capability that only appears with scale, on these math tasks around 100B. The companion paper calls it an emergent ability and is explicit that the threshold is task-specific, not a universal law.

Is the model actually reasoning, or imitating the example’s shape?
The paper is careful here. It works even on inputs longer than any example (more people in the coin flip, more words in the letter task), which looks like a per-step procedure rather than whole-answer copying. But there is no guarantee of a correct path: a chain can be wrong yet reach the right answer, or right yet for the wrong reason. Whether this is "reasoning" is left open.

Is this the same as "Let’s think step by step"?
No. That is zero-shot chain of thought (Kojima et al., May 2022), a later and separate paper that needs no examples. This one is few-shot: you supply eight worked examples. Self-consistency, which samples many chains and takes the majority answer, is another later follow-up. This paper is few-shot and greedy.

Footnotes & further reading

The paper: Wei, Wang, Schuurmans, Bosma, Ichter, Xia, Chi, Le, Zhou, Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Google Research, NeurIPS 2022).
The baseline it rides on: Brown et al., Language Models are Few-Shot Learners (GPT-3), which introduced the zero / one / few-shot in-context-learning setup with no weight updates.
The benchmark and the prior best: Cobbe et al., Training Verifiers to Solve Math Word Problems (GSM8K), whose finetuned-175B-plus-verifier system set the ~55% record on GSM8K.
The emergence framing: Wei et al., Emergent Abilities of Large Language Models (TMLR), the companion paper that defines emergence and lists chain-of-thought reasoning as an example.
The models: Chowdhery et al., PaLM (8B / 62B / 540B), and Thoppilan et al., LaMDA (up to 137B). The GPT-3 sizes used in the paper (text-davinci-002 and friends) are the authors' stated presumption, not OpenAI-confirmed figures.
Later and related, both distinct from this paper: Wang et al., Self-Consistency Improves Chain-of-Thought Reasoning (sample many chains, take the majority answer), and Kojima et al., Large Language Models are Zero-Shot Reasoners (the zero-shot "Let's think step by step").

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.