Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Show a model the steps, and reasoning emerges.
A 540-billion-parameter model, shown eight worked examples and trained on none of the test data, beats a specialist finetuned on the whole dataset. The trick is to make the model write its steps out loud, and it only starts working once the model is large enough to follow along.
Explaining the paperChain-of-Thought Prompting Elicits Reasoning in Large Language ModelsWhat if the way to make a model reason is not to retrain it, but to show it how to think out loud?
For a few years the recipe for a better language model was simple: make it bigger. Scaling bought a lot, but it did not buy multi-step reasoning. A model that could draft a sonnet would still trip over a two-step word problem a ten-year-old can do. And the usual lever, scale, barely helped: on arithmetic and logic the accuracy-versus-size curve was nearly flat. You could multiply the parameter count and watch the math scores sit exactly where they were.
This paper, out of Google in January 2022, found the missing piece, and it was not in the model. It was in the prompt. Take a large model off the shelf, change nothing about its weights, and instead of showing it example questions paired with bare answers, show it example questions paired with the reasoning that leads to each answer. The model, continuing the pattern, writes out its own reasoning on the next question, and the accuracy jumps. On the hardest math benchmark in the paper, it more than tripled.
The headline result sounds wrong at first. GSM8K is a set of grade-school math word problems (8,500 of them, each taking two to eight steps). The prior record on it belonged to a 175-billion-parameter model finetuned on the entire training set and paired with a separate trained verifier that re-ranked a hundred candidate answers. That whole system scored about 55%. A 540-billion-parameter model called PaLM, given eight worked examples in its prompt and trained on none of GSM8K, scored 57%. Eight examples beat a finetuned specialist.
The authors call the trick chain-of-thought prompting, and a chain of thought is just the series of intermediate steps a person would say out loud while solving the problem. The paper is the careful version of "we tried showing the model the working, and it helped a lot, but only past a certain size, and here is everything we did to rule out the boring explanations." To see why it works, and why it is stranger than it first looks, we will build up a short tower: what in-context learning is and why reasoning was its blind spot, what a chain of thought actually changes about the prompt, the deflationary explanations and how the paper kills them, why none of it works until the model is huge, and what the result does and does not tell us about whether the model is "reasoning" at all.
A prompt is a program
Start with the thing chain-of-thought is built on. A large language model is, mechanically, a next-token predictor: you feed it a string, it returns a probability distribution over the next token, you pick one, append it, and repeat. Nothing in that loop says "trained for task X." So how do you get it to do a specific task?
GPT-3 (Brown et al., 2020) popularized the answer this paper rides on: in-context learning, also called few-shot prompting. You do not finetune. You write a few demonstrations of the task directly into the prompt, an input paired with its output, then the real input, and let the model complete it. Zero-shot means no demonstrations, one-shot means one, and few-shot means as many as fit in the context window, typically ten to a hundred. The crucial part: no weights change. The demonstrations are just tokens the model reads in a single forward pass, and it infers the pattern from them on the fly. One frozen checkpoint, steered across many tasks by nothing but the text in front of it.
This is a genuinely good deal, and it is why "prompting" became a field. But it had a conspicuous blind spot. On tasks that need several reasoning steps (arithmetic word problems, logic, symbolic manipulation), few-shot prompting worked poorly, and, more tellingly, it often did not improve much as the model grew. Most capabilities have a friendly upward scaling curve: bigger model, better score. Reasoning did not. You could pour in parameters and the math accuracy would barely move (this is what Rae and colleagues reported for the 280B Gopher model). That flat curve is the wall this paper walks up to. Hold onto it, because it is half the surprise later: standard prompting on these tasks is nearly flat in scale.
The other half is why nobody had a cheap fix. You could train the reasoning in: collect thousands of problems with worked solutions and finetune, which is exactly what the GSM8K authors did. That works, but it is expensive, specific to one dataset, and it throws away the "one model, many tasks" property that made prompting attractive in the first place. The appeal of what comes next is that it keeps that property. Nothing gets trained.
Chain of thought: show the work
Here is the entire method. In standard few-shot prompting, each demonstration is a pair: a question, then the answer. Chain-of-thought prompting makes the demonstration a triple: a question, then the reasoning steps, then the answer. That is the whole idea. You show the model not just what the answer is but how you got there, and you trust it to imitate the how.
A chain of thought is a series of short, natural-language intermediate steps that lead to the final answer. Take the example the paper opens with. The question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? The chain: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11. It reads like a solution, and it is one. The authors call it a chain of thought to stress that it mimics a step-by-step thought process rather than a polished write-up. (Note that real solutions usually come after the answer; a chain of thought comes before it, and that ordering turns out to matter.)
The mechanism is pure imitation. The model has seen, in its prompt, that an "A:" is followed by a few sentences of arithmetic and then "The answer is N." So when it reaches the new question's "A:", it does the same: it writes sentences of arithmetic, and only then commits to a number. Watch what that changes on the cafeteria problem in Figure 1. With a bare-answer demonstration, the model pattern-matches "produce a number" and blurts 27, which is wrong. With a worked demonstration, it writes 23 minus 20 is 3, they bought 6 more, so 3 + 6 = 9, and answers 9. Same model, same question. The only thing that moved was the shape of the example.
A few specifics, because they matter for trusting the result. The exemplars were written by hand, eight of them, and the same eight were used for every arithmetic benchmark except the multiple-choice AQuA (which got four). They were not tuned: the authors say plainly that these particular exemplars did not undergo prompt engineering, and they later measure how much the exact wording matters (it does not matter much). Decoding is greedy, meaning the model takes its single most likely continuation, with no sampling and no voting. That last detail dates the paper. Sampling many chains and keeping the majority answer ("self-consistency") is a later improvement, and so is the zero-shot version that drops the examples and just appends "Let's think step by step." Both came afterward, and both are different papers. This one is few-shot and greedy.
Here is the literal artifact, the text fed to the model. Standard prompting:
# standard few-shot prompt: each example is question, then bare answer
Q: Roger has 5 balls. He buys 2 cans of 3 balls each. How many now?
A: The answer is 11.
Q: The cafeteria had 23 apples, used 20, and bought 6 more. How many?
A: # the model continues here -> "The answer is 27." (wrong)and the chain-of-thought version, identical except the demonstrated answer shows its work:
# chain-of-thought prompt: question, the working, THEN the answer
Q: Roger has 5 balls. He buys 2 cans of 3 balls each. How many now?
A: Roger started with 5. 2 cans of 3 is 6 balls. 5 + 6 = 11.
The answer is 11.
Q: The cafeteria had 23 apples, used 20, and bought 6 more. How many?
A: # continues -> "23 - 20 = 3. 3 + 6 = 9. The answer is 9."The model reads everything above the final "A:" and continues from there. Change those few demonstration lines and you change what it writes for every question that follows.
Why writing it out could help
Before the experiments, stay suspicious. Several reasons this might work are plausible, and at least three of them turn out to be wrong, or beside the point.
The intuitive story the authors offer has a few parts, and they are reasons to like the method rather than evidence that the chain does the work. A chain of thought lets the model break a multi-step problem into pieces and handle them one at a time, so a harder problem can get more intermediate steps (and so more of the model's computation) than an easy one. It gives a readable window into how the model reached its answer, useful for finding where the reasoning went wrong (they add, carefully, that a plausible-looking chain does not prove the model actually computed that way). And because the steps are just language, the same trick applies to anything you can reason about in words, not only to arithmetic.
Those are nice properties. They are not yet a reason to believe the chain itself matters. There are at least three deflationary explanations that would make the gains far less interesting, and each is a hypothesis to be tested, not a conclusion:
- Maybe it is the equation. A worked solution contains the arithmetic expression. Perhaps all the model needs is to be nudged into writing
5 + 6before answering, and the prose around it is decoration. - Maybe it is the extra compute. A chain is more tokens, and generating more tokens means more sequential computation before the answer commits. Perhaps any filler of the same length would do.
- Maybe it is just retrieval. Perhaps writing anything before the answer helps the model warm up and surface relevant knowledge, and the specific reasoning is incidental.
The value of the whole result depends on killing these. If the gain were really "the equation" or "more tokens," chain-of-thought would be a parlor trick. The ablations are what show it is not. First, the caveat that reframes everything.
Reasoning emerges with scale
Chain-of-thought prompting does not work on small models. It is worse than useless on them: below roughly 10 billion parameters, adding a chain of thought tends to lower accuracy. The small models produce chains that are fluent, grammatical, and wrong, confident reasoning that walks off a cliff, and they end up scoring below the model that just guessed.
The benefit appears only with scale, and arrives fast when it does. On GSM8K, PaLM at 8B scores 4.9% with standard prompting and 4.1% with a chain of thought, slightly worse. At 62B it is 9.6% against 29.9%. At 540B it is 17.9% against 56.9%. The standard-prompting curve creeps along near the floor; the chain-of-thought curve is flat-then-vertical. GPT-3 does the same thing: chain of thought hurts at 350M and at 6.7B, then at 175B it flips a 15.6% into a 46.9%.
This pattern has a name. An ability is emergent (Wei and colleagues, in a companion paper) if it is absent in small models and present in large ones, in a way you could not have predicted by extrapolating the small models' performance. The curve stays near random until some critical scale, then rises sharply, a qualitative change that looks like a phase transition. Chain-of-thought reasoning on math word problems is one of that paper's examples, emerging around training FLOPs, very roughly 100 billion parameters.
Two honest qualifications. The "around 100 billion" figure is specific to these reasoning tasks, not a universal law. The companion paper is explicit that the scale at which an ability emerges depends on the task and the model, ranging from a few billion to several hundred billion across different abilities. And "emergent" here is an empirical description, not a mechanism. Nobody claims a switch flips at 100B. What the data says is narrow and strong: for these problems, the chain-of-thought gain is roughly zero or negative until the model is very large, and then it is big.
The size of the gain also tracks the difficulty of the problem, which is a good sign that something real is going on. The improvement is largest exactly where the baseline is worst. GSM8K, the hardest set with the lowest standard-prompting score, more than doubled for the biggest GPT and PaLM models. On SingleOp, the easiest subset (a single arithmetic operation, which large models already handle), chains of thought help barely or slightly hurt. You do not get a windfall on problems that did not need steps. You get it on the problems that did.
Isolating what actually matters
Now kill the three deflationary explanations. The setup is clean: one model (LaMDA 137B), the same problems, and three alternative prompts, each built to supply one suspected ingredient of a chain of thought without the rest.
chain · Reasoning in plain language, then the answer. The full method.
Equation only. Prompt the model to emit just the mathematical expression (the5 + 6 = 11 with none of the words) and then the answer. If the gain were really about getting the equation onto the page, this should recover most of it. On GSM8K it does not: 5.4%, actually below the 6.5% standard baseline. The problems are too tangled to compress straight into an equation without the natural-language steps that set it up. (On the easier MAWPS benchmark, equation-only does help, 50.1% against a 43.2% baseline, because once you have the equation those problems are trivial. So equation-only lands in between, useful when the translation is easy and useless when it is hard. The full chain is what carries the hard cases.)
Variable compute only. This tests the worry that the win is just spending more tokens. So prompt the model to output a row of dots, the same number of characters the equation would have, then the answer. Pure compute, the same token budget, zero content. Result: 6.4%, indistinguishable from the 6.5% baseline. Spending more tokens is not the mechanism. The tokens have to mean something.
Reasoning after the answer. This tests the worry that the chain merely warms the model up and helps it surface knowledge, with the reasoning incidental. So flip the order: have the model give the answer first and the reasoning afterward. Now the chain cannot inform the answer, since the answer already came out. Result: 6.1%, again at baseline. The reasoning has to come before the answer, where the answer can depend on it. The sequence is doing real work; it is not a warm-up.
Put the three together and the conclusion is sharp. It is not the equation, it is not the token count, and it is not knowledge-jogging. It is natural-language reasoning, placed before the answer, where the answer can follow from it. That is the one variant that moves the number: 14.3% against a clustered 5 to 6.5% for everything else. The deflationary explanations are dead.
Beyond arithmetic, beyond the examples
Math is where the effect is sharpest, but the method is made of language, so it travels. The paper runs it on two more families of task.
Commonsense reasoning is questions about how the everyday world works: would a pear sink in water, is this sports sentence plausible, what sequence of actions fulfills a spoken request. Here a chain of thought is the model stating the relevant facts before it commits. On Sports Understanding, PaLM 540B with a chain of thought reaches 95.4%, past both the prior system and an unaided human sports fan at 84%. On StrategyQA it clears the prior state of the art. The gain is not universal, and the authors say so: on CSQA it was minimal. Chain of thought is a lever, not a miracle, and it helps most where the task genuinely breaks into stated steps.
Symbolic reasoning is where the most revealing evidence lives, because two toy tasks isolate something the math problems cannot. Last-letter concatenation: take the last letter of each word in a name and string them together. Coin flip: a coin starts heads, some people flip it and some do not, is it still heads? Both are trivial to specify, and both have a property the authors exploit. You can build a test case longer than any example you showed the model. Show it only two-word names, then ask about four-word names. Show it two-person coin flips, then ask about four-person ones.
This is the cleanest evidence that the model is running a procedure rather than matching whole answers. With standard prompting, the model saw two-step cases and fails the moment you go longer; out of domain it collapses toward chance. With a chain of thought, it keeps the running state and applies the same per-step rule to however many steps you hand it, so it generalizes to lengths it was never shown. PaLM 540B solves the in-domain coin flip essentially perfectly and keeps tracking on longer sequences (accuracy does sag as the chains get long, so this is graceful degradation, not magic). A model that only memorized whole answers for two-flip cases could not do that. Writing the running state out at each step and reading it back at the next can. (The paper shows this behavior; it does not pin down how the model holds the state inside.)
And again the floor is scale. These tasks are toys precisely because the chain in the exemplars already spells out the perfect procedure; all the model has to do is repeat the steps with new symbols. Small models still fail even that. The ability to mechanically apply a shown procedure to unseen symbols is, itself, something that only shows up around 100 billion parameters.
Not a fragile prompt trick
A fair worry about any prompting result is that it is an artifact of one lucky prompt. Few-shot prompting is famously twitchy: there are documented cases where merely reordering the demonstrations swings accuracy from near chance to near state of the art. So is chain of thought just a good roll of the dice on eight hand-picked examples?
The paper checks, and the answer is no. Three different people wrote independent chains of thought for the same questions. A deliberately concise rewrite was tried. And three more sets of eight examples were sampled at random from the GSM8K training data instead of being hand-written. Every one of these prompt sets beats standard prompting by a wide margin. On GSM8K with LaMDA 137B they land between roughly 11% and 18%, all well above the 6.5% baseline. There is real variance between them, as expected for prompting, but the floor of the chain-of-thought results sits comfortably above the ceiling of standard prompting. The effect survives changing the author, the wording, the style, and the specific examples. It is the method that helps, not one magic prompt. (They report it is robust to the order of the examples and to how many you use, as well.)
What it does, and what it does not
Step back to the headline, now that it is earned.
On GSM8K, PaLM 540B with eight chain-of-thought examples and no training scores 57%, edging past the 55% the previous record-holder reached by finetuning a 175B model on the full dataset and bolting on a verifier that re-ranked a hundred sampled solutions. The same PaLM with standard prompting manages 18%. The distance between 18 and 57 is the chain, and it cost eight hand-written examples.
The broader claim the authors draw is the one worth keeping. Standard prompting had made these models look like they could not do multi-step reasoning, and that was largely a measurement artifact. The capability was latent; the prompt was hiding it. Their phrasing: standard prompting gives only a lower bound on what a large model can do. Change how you ask, and a flat scaling curve becomes a steep one. That reframing, more than any single benchmark number, is why this short paper mattered. It suggested that much of "the model cannot do X" was really "we have not asked for X correctly," and it set off a wave of prompting methods: self-consistency, least-to-most, the zero-shot "think step by step," and eventually models trained to produce long chains by default.
The limits are real, and the paper states them. It does not show the model is "reasoning" in any deep sense. It shows that producing reasoning-shaped text helps, which is not the same claim, and they leave the deeper question open. There is no guarantee the chain is correct: a model can reach the right answer through a broken chain (in one hand-checked sample of 50 correct GSM8K answers from LaMDA, two got there by luck on faulty reasoning) or a wrong answer through a chain that looked fine. The method is also costly to deploy, precisely because it works only on very large models, which are expensive to serve. And while writing eight examples is cheap, building a large labeled set of chains for finetuning would not be. None of this dents the finding. It marks its edges.
The lasting lesson is embarrassingly simple. The right answers were reachable the whole time, and we were asking for them the wrong way. Show a model the working, on enough examples, once it is large enough to follow along, and the steps come out and the answers land. Whether that adds up to reasoning is the question the paper leaves open. That the capability was sitting in the weights, waiting on the shape of the prompt, is not.
Questions you might still have
Does chain-of-thought prompting change the model’s weights?
No. Nothing about the model changes. The worked examples are just extra tokens in the prompt, read in one forward pass. The paper finetunes nothing; a single frozen checkpoint does every task.
Why does it only work for big models?
Below roughly 10B parameters the chains are fluent but illogical, and they lower accuracy. Producing a coherent, correct chain is itself a capability that only appears with scale, on these math tasks around 100B. The companion paper calls it an emergent ability and is explicit that the threshold is task-specific, not a universal law.
Is the model actually reasoning, or imitating the example’s shape?
The paper is careful here. It works even on inputs longer than any example (more people in the coin flip, more words in the letter task), which looks like a per-step procedure rather than whole-answer copying. But there is no guarantee of a correct path: a chain can be wrong yet reach the right answer, or right yet for the wrong reason. Whether this is "reasoning" is left open.
Is this the same as "Let’s think step by step"?
No. That is zero-shot chain of thought (Kojima et al., May 2022), a later and separate paper that needs no examples. This one is few-shot: you supply eight worked examples. Self-consistency, which samples many chains and takes the majority answer, is another later follow-up. This paper is few-shot and greedy.
Footnotes & further reading
- The paper: Wei, Wang, Schuurmans, Bosma, Ichter, Xia, Chi, Le, Zhou, Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Google Research, NeurIPS 2022).
- The baseline it rides on: Brown et al., Language Models are Few-Shot Learners (GPT-3), which introduced the zero / one / few-shot in-context-learning setup with no weight updates.
- The benchmark and the prior best: Cobbe et al., Training Verifiers to Solve Math Word Problems (GSM8K), whose finetuned-175B-plus-verifier system set the ~55% record on GSM8K.
- The emergence framing: Wei et al., Emergent Abilities of Large Language Models (TMLR), the companion paper that defines emergence and lists chain-of-thought reasoning as an example.
- The models: Chowdhery et al., PaLM (8B / 62B / 540B), and Thoppilan et al., LaMDA (up to 137B). The GPT-3 sizes used in the paper (text-davinci-002 and friends) are the authors' stated presumption, not OpenAI-confirmed figures.
- Later and related, both distinct from this paper: Wang et al., Self-Consistency Improves Chain-of-Thought Reasoning (sample many chains, take the majority answer), and Kojima et al., Large Language Models are Zero-Shot Reasoners (the zero-shot "Let's think step by step").
How could this explainer be improved? Found an error, or something unclear? I read every message.