VerifiedarXiv:2203.0215526 min
RL & alignment · LLMs

Training language models to follow instructions with human feedback

A 1.3B model, aligned with human feedback, beats one 100× its size.

GPT-3 was trained to predict the next word, not to do what you ask. InstructGPT fixes the objective with three steps and a crowd of human judges, and the result is a smaller model people would rather talk to.

Explaining the paperTraining language models to follow instructions with human feedbackOuyang, Wu, Jiang, et al. (OpenAI) · NeurIPS 2022 · arXiv:2203.02155

What does it take to turn a model that finishes your sentence into one that answers your question?

Bigger is not more obedient

GPT-3 is trained on one objective: given a stretch of internet text, predict the next token. Do that across a few hundred billion tokens and you get something remarkable at exactly one skill, continuing text in a way that looks plausible. What you do not get is an assistant. Ask the raw model a question and it might answer it, or it might list three more questions, or imitate a forum thread it half-remembers, because all of those are plausible continuations of a question on the internet.

This is the alignment gap: the objective we can train (predict the next token) is a stand-in for the objective we actually want (follow the user's instructions helpfully, honestly, and harmlessly), and the two come apart in the open. Scaling the model up does not close the gap, it only sharpens the text-continuer: a 175B model is better autocomplete, not a better assistant.

InstructGPT is OpenAI's answer to that gap. The method itself is not new. The paper says so plainly: it follows Ziegler et al. (2019) and Stiennon et al. (2020), who fine-tuned language models from human preferences for stylistic continuation and summarization, which in turn build on Christiano et al. (2017), the paper that first learned a reward from human preferences (for game-playing and simulated robots, with no language and no PPO in sight). What InstructGPT does is point that machinery at the messy, open-ended distribution of things people actually type into an API, and show it works there.

The headline is the kind of result that reorders priorities. A 1.3B InstructGPT model, one hundred times smaller than the 175B GPT-3, produces answers that labelers prefer to the giant's. Same model family, far fewer parameters, and better liked, because it was trained against the objective people care about. The rest of this piece is how that happens.

Three terms will recur, so pin them down now. Helpful means it does what you asked. Honest means it does not make things up. Harmless means it declines to do damage. These are the axes the labelers were asked to judge, and the whole apparatus below exists to push the model along them.

The plan: three steps

You cannot write down "be helpful" as a loss function. So InstructGPT learns it in three steps, each one handing something to the next. Click through them:

Figure 1 · the pipeline
SFT
Step 1 fine-tunes GPT-3 to imitate demonstrations. Step 2 trains a reward model on ranked comparisons. Step 3 optimizes the policy on fresh prompts to maximize that reward, leashed to the step-1 model. Steps 2 and 3 can repeat on the improved policy.

Step 1, supervised fine-tuning (SFT). Labelers write good answers to a batch of prompts, and GPT-3 is fine-tuned to copy them. This is ordinary supervised learning: show the model the desired behavior and nudge its weights toward producing it. The paper used about 13,000 demonstration prompts. Oddly, the SFT model overfits its validation loss after a single epoch, yet training it longer (16 epochs) still improves the downstream reward and human ratings, so they keep going past the point where the loss says to stop.

SFT gets you a model that follows instructions reasonably well. It is also where the other two steps start from. The reward model in step 2 is built on top of the SFT model, and in step 3 both the policy being trained and the reference it is held against are copies of the SFT model. SFT is the floor the rest of the method stands on.

So why not stop here and just collect more demonstrations? Because demonstrations have a ceiling. They teach the model to imitate what a labeler would write, and no more. To get past that ceiling you need a way to tell the model that one answer is better than another, including answers no labeler wrote. That is steps 2 and 3, and the reason they exist starts with a fact about human effort.

Comparing beats writing

Judging is cheaper than producing. It is faster to taste two dishes and say which is better than to cook a great one. It is faster to read two answers and pick the stronger than to write the strong one yourself. And the judgment is often more reliable, because you can recognize quality you could not generate.

InstructGPT leans on this hard. Instead of asking labelers to write answers, it shows them several answers the model already produced and asks them to rank them. A single ranking is generous with data. Rank KK responses and you can read off the winner of every pair at once, which is (K2)\binom{K}{2} comparisons from one act of judgment. The paper shows labelers between four and nine responses per prompt. Drag KK and watch the harvest grow:

Figure 2 · one ranking, many pairs
15 pairs
Ranking KK responses to a prompt yields (K2)\binom{K}{2} pairwise comparisons, from 6 at K=4K{=}4 to 36 at K=9K{=}9. Each chord is one training pair for the reward model, and all the pairs from a prompt are fed as a single batch element.

That last clause hides a real lesson. The obvious thing to do with all these pairs is to throw them into one pile and shuffle. The paper tried that and the reward model overfit after a single pass, because the (K2)\binom{K}{2} pairs from one prompt are highly correlated (they reuse the same handful of responses), so shuffling lets the model see each response in up to K1K-1 separate gradient updates. The fix is to treat all of a prompt's comparisons as one batch element: one forward pass per response, every pair scored from those same scores. It stops overfitting and it is cheaper. With about 33,000 ranked prompts, this is the data the reward model learns from.

A model of human taste

Now the central trick. A pile of "A beats B" judgments is not yet something you can optimize against. You cannot run reinforcement learning on a lookup table of past comparisons; the model will produce new answers that are in nobody's table. So InstructGPT trains a second network, the reward model, to predict the judgments. Feed it a prompt and a response and it returns a single number, a scalar reward rθr_\theta, that stands in for "how much would a labeler like this."

How do you turn pairwise preferences into a scalar? With a model that is older than deep learning. Give each response a score, and say the probability a human prefers the winner is the logistic of the score gap. This is the Bradley-Terry model (1952); the paper does not call it that, but that is what it is, framed instead as a cross-entropy loss in which the reward difference is the log-odds of preference:

P(ywyl)=σ(rθ(x,yw)rθ(x,yl))=erθ(x,yw)erθ(x,yw)+erθ(x,yl)P(y_w \succ y_l) = \sigma\big(r_\theta(x, y_w) - r_\theta(x, y_l)\big) = \frac{e^{\,r_\theta(x, y_w)}}{e^{\,r_\theta(x, y_w)} + e^{\,r_\theta(x, y_l)}}

Only the gap rwrlr_w - r_l matters. Put numbers on it: if the model scores response A at 55 and response B at 33, the gap is 22, and the model says a labeler prefers A with probability σ(53)=σ(2)0.88\sigma(5 - 3) = \sigma(2) \approx 0.88. The score gap is the log-odds of the preference: a gap of 22 means A is favored about 0.88/0.1270.88 / 0.12 \approx 7 to one. That is the standard way to model a noisy judgment, one party is better by some amount and the verdict goes their way more often the larger that amount, never with certainty. When the two scores are equal the preference is a coin flip at fifty percent; as the winner pulls ahead, the probability rises along an S-curve toward one. Drag the gap and watch the preference follow:

Figure 3 · the reward gap becomes a preference
P(win) = 77%
The chance a labeler prefers the winner is σ(rwrl)\sigma(r_w - r_l), the logistic of the reward gap. A gap of zero is a coin flip; a positive gap bends the vote split toward the winner. Training the reward model means choosing scores so this matches the labelers.

To train the reward model you minimize the negative log-likelihood of the labelers' choices under this model. That is equation (1), with the 1/(K2)1/\binom{K}{2} averaging each prompt's pairs:

loss(θ)=1(K2)E(x,yw,yl)D[logσ(rθ(x,yw)rθ(x,yl))]\operatorname{loss}(\theta) = -\frac{1}{\binom{K}{2}}\, \mathbb{E}_{(x,\,y_w,\,y_l)\sim D}\Big[\log \sigma\big(r_\theta(x, y_w) - r_\theta(x, y_l)\big)\Big]
(1)

Read it as one instruction: make the preferred response score higher than the dispreferred one, and pay a penalty, logσ(rwrl)-\log\sigma(r_w - r_l), that shrinks toward zero as the gap grows the right way and blows up when the model ranks a pair backwards. In code it is a loop over a prompt's pairs, scored from one forward pass per response:

# reward model: one batch element = all C(K,2) pairs from ONE prompt
resps  = rank_responses(prompt)       # K=4..9, labeler-ranked best->worst
scores = reward_model(prompt, resps)  # one scalar reward per response
loss = 0
for w, l in winner_loser_pairs(resps):    # every (preferred, dispreferred)
    loss += -log_sigmoid(scores[w] - scores[l])
loss = loss / num_pairs               # the 1 / C(K,2) average
loss.backward()                       # one forward pass per response

Two practical notes the paper is careful about. First, the loss depends only on the gap, so adding a constant to every reward changes nothing. The scores are determined only up to a shift. Before RL begins they pin the scale down by adding a bias so the labeler demonstrations average a reward of zero, giving the numbers a fixed reference point.

Second, the size. You might expect the reward model to be the 175B giant. It is not. They only use 6B reward models, and one single 6B model is reused across every policy size. The reason is twofold: training a 175B reward model was unstable, which made it a poor initialization for the value function PPO needs, and a 175B reward model plus value function would have ballooned the compute of step 3. The 6B model was stable and led to equally strong policies. The judge does not have to be larger than what it judges.

Chase the reward

With a reward model in hand, step 3 is reinforcement learning. The setup is a one-shot bandit: a prompt arrives, the model (now the policy πRL\pi^{\mathrm{RL}}) writes a full response, the reward model scores it, and the episode ends. The policy's job is to produce responses the reward model scores highly. The algorithm that does the optimizing is PPO (Schulman et al., 2017), the same one with its own explainer here. The value function PPO needs is initialized from the reward model, which already knows the shape of "good."

If reward were the whole story, you would just maximize rθr_\theta. It is not the whole story, and the term that protects against that is the heart of the method. The objective InstructGPT actually maximizes is equation (2):

objective(ϕ)=E(x,y)DπϕRL ⁣[rθ(x,y)βlogπϕRL(yx)πSFT(yx)]+γExDpretrain[logπϕRL(x)]\operatorname{objective}(\phi) = \mathbb{E}_{(x,y)\sim D_{\pi^{\mathrm{RL}}_\phi}}\!\Big[\, r_\theta(x, y) - \beta \log \frac{\pi^{\mathrm{RL}}_\phi(y \mid x)}{\pi^{\mathrm{SFT}}(y \mid x)} \,\Big] + \gamma\, \mathbb{E}_{x\sim D_{\mathrm{pretrain}}}\big[\log \pi^{\mathrm{RL}}_\phi(x)\big]
(2)

Three pieces. The first, rθ(x,y)r_\theta(x,y), is the reward we want to climb. The third, the γ\gamma term, is a separate fix we will get to in the alignment-tax section; for plain "PPO" models γ\gamma is set to zero. The middle term is the leash:

βlogπϕRL(yx)πSFT(yx)-\,\beta \log \frac{\pi^{\mathrm{RL}}_\phi(y \mid x)}{\pi^{\mathrm{SFT}}(y \mid x)}

That log-ratio is, in expectation over tokens the policy samples, the KL divergence from the frozen SFT model. (Strictly: a single token's log-ratio is not a KL value and can even be negative; it is an unbiased estimator of the KL only on average. The paper adds it as a per-token penalty.) Read the KL as a probabilistic odometer: it is zero when the policy assigns text the same probabilities the SFT model does, and it grows as the policy shifts probability mass onto text the SFT model considered unlikely. The penalty does not pin any particular output; it charges for total redistribution. Subtracting it with coefficient β\beta means: chase the reward, but pay for every step you take away from where you started. The figure below lays that bargain out as curves, drag β\beta and watch where the peak of the combined objective settles. Why you would ever want to hold the model back is the next section.

Figure 4 · where the optimum settles
x* ≈ 3.0
The two terms of equation (2), over a schematic 1D policy space (drift from SFT on the horizontal axis). The reward proxy rises and flattens; the drift toll βKL-\beta\,\mathrm{KL} bows downward from the SFT anchor and accelerates. Their sum, the bright combined objective, peaks at finite drift: the dot. A bigger β\beta deepens the toll and pulls the optimum home; a smaller one lets it settle far out. The paper charges the log-ratio per token, so the toll accrues along the whole generation. Curve shapes are schematic, not measurements; the shape is the claim.

The reward PPO sees at each token is the model's score minus that penalty, assembled like this:

# step 3: build the reward PPO actually optimizes, per token
y      = policy.sample(x)             # policy = pi_RL, initialized from SFT
r_end  = reward_model(x, y)           # one scalar, credited at the last token
# per-token KL penalty: stay close to the frozen SFT reference
kl     = beta * (logp(policy, y, x) - logp(sft, y, x))
reward = r_end - kl                   # the shaped reward handed to PPO
# PPO-ptx only: also raise log-prob on a batch of pretraining text
loss_ptx = -gamma * logp(policy, pretrain_batch)

A note to head off a common confusion. There are two distinct uses of KL in this story, and they are not the same leash. PPO has its own mechanism, a clipped surrogate that limits how far each update moves the policy from the previous step's policy. The KL penalty above is different: it measures distance from the fixed SFT model and is folded into the reward, not the optimizer. PPO's clip keeps each step small; the KL penalty keeps the whole journey near home.

The reason both are needed is that they guard against different failures. PPO's clip is a per-update trust region: it stops any single gradient step from lurching the policy somewhere the reward estimate cannot be trusted, which keeps the optimization stable. But many small, individually-clipped steps still compound, and nothing in the clip cares where they started. A run that takes a thousand tiny, well-behaved steps can still walk the policy a long way from the SFT model it began as. The KL penalty is the cumulative leash that the clip is not: it charges for total drift from SFT no matter how gently you accumulated it, so the policy can chase reward but pays a growing toll the further it strays from where it started. One bounds the size of a step; the other bounds the distance of the destination.

Why you cannot just maximize it

The reward model is not human judgment. It is a 6B-parameter guess at human judgment, trained on a finite cloud of (prompt, response) pairs, and its judgment is trustworthy only near that cloud. Push the policy far enough outside it and the score becomes extrapolation: optimize hard enough and the policy will find responses that score high not because they are good, but because they exploit the reward model's blind spots, outputs the reward model rates highly for no reason a human would endorse. This is Goodhart's law: a measure that becomes a target stops being a good measure.

The picture below is the one to hold. As the policy moves away from SFT (measured as KL distance, on the horizontal axis), the proxy reward the model reports keeps climbing. But the true quality a human would assign rises, peaks, and then falls. Past the peak you are no longer improving the model; you are teaching it to game the proxy. Drag β\beta to change the leash length and watch where the optimum lands:

Figure 5 · the leash and over-optimization
KL* ≈ 7.2
The proxy reward rises with distance from SFT; true quality peaks then falls (the shaded over-optimization zone). The objective rβKLr - \beta\,\mathrm{KL} is maximized at a finite distance. A bigger β\beta shortens the leash toward SFT; too small and the policy runs past the peak. The turn-over curve is the documented phenomenon (Stiennon 2020; Gao 2022), shown schematically, not a plot from InstructGPT.

This is exactly what β\beta buys. The KL penalty does not care about reward at all; it only cares about distance from the frozen SFT model. Adding it bends the objective so its maximum sits at a finite KL, short of the cliff where true quality collapses. Set β\beta too high and the leash is so tight the model barely improves. Set it too low and the policy sprints past the peak into the over-optimized region, scoring beautifully on the reward model and badly with people. InstructGPT cites over-optimization as the reason the term is there at all.

One honest caveat about the figure. The turn-over of true quality is a real, documented phenomenon, but it is not InstructGPT's own result. The canonical version is in Stiennon et al. (2020), and it was later quantified as clean scaling laws by Gao, Schulman, and Hilton (2022), who measured it against the square root of the KL distance. The curve above is schematic, drawn to show the shape, not to report numbers. InstructGPT inherits the lesson and the leash, not the plot.

The alignment tax

Steps 1 to 3 produce a far more helpful model. They also quietly cost something. When you push a model hard toward one objective, it can get worse at the things that objective does not mention. The paper measured this directly: a plain PPO model regresses against GPT-3 on a spread of public NLP benchmarks, including SQuADv2, DROP, HellaSwag, and WMT 2015 French-to-English translation. The authors call this the alignment tax, an extra cost you pay for aligning the model. If the tax is high, people will quietly keep using the unaligned-but-capable model, which is the opposite of what you want.

The fix is the γ\gamma term from equation (2), the one we set aside. The regressions say the policy is losing general language competence it had absorbed in pretraining, so the remedy is to keep rehearsing it: during RL, mix in batches of the original pretraining data and reward the model for keeping its log-probability on them high. This pulls the policy back toward being a competent language model while the reward term pulls it toward being helpful. Models trained this way are called PPO-ptx ("ptx" for pretraining mix), and unless the paper says otherwise, "InstructGPT" means PPO-ptx. Toggle between the two:

Figure 6 · paying the tax back
PPO: helpful, but NLP drops below GPT-3
Both PPO and PPO-ptx are far more helpful than GPT-3. But on public NLP benchmarks plain PPO drops below the GPT-3 baseline (the tax), while PPO-ptx recovers to it. Bar heights are schematic; the directions follow the paper's Figures 1 and 29.

The win is real but narrower than a slogan, in two respects. Adding the pretraining mix mitigates the regressions on all of these datasets, and on HellaSwag it even surpasses GPT-3, but on DROP, SQuADv2, and translation it recovers toward the baseline rather than beating it. And the paper checked the obvious alternative, fighting the tax by cranking up the KL coefficient β\beta to keep the policy closer to GPT-3: mixing in pretraining works better, since raising β\beta costs reward and never fully recovers DROP and SQuAD. The two terms in equation (2) do different jobs: β\beta fights over-optimization, γ\gamma pays the alignment tax.

Does it actually work

On the metric the whole method targets, yes, and convincingly. Labelers compared outputs head to head. The 175B InstructGPT is preferred to the 175B GPT-3 about 85% of the time, and preferred 71% of the time even to a GPT-3 that has been hand-prompted into instruction-following mode. The result that traveled is the small one: the 1.3B InstructGPT, a hundred times smaller, is still preferred to the 175B GPT-3. Switch matchups:

Figure 7 · who wins the head-to-head
Labeler win-rate, drawn as a tug-of-war. The 175B InstructGPT beats the 175B GPT-3 85 ± 3% of the time and a few-shot GPT-3 71 ± 4%. The punchline: the 1.3B InstructGPT still beats the 175B GPT-3, a 100× smaller model. (The exact rate for that last matchup is not reported, so it is shown as above 50%.)

The helpfulness shows up on concrete axes too. Compared to GPT-3, InstructGPT more often attempts the right instruction, follows explicit constraints ("answer in two paragraphs"), and makes up fewer facts on closed-domain tasks. It even generalizes a little past its training: it can follow instructions in languages and answer questions about code, despite those being a tiny slice of the fine-tuning data, which is a hint that alignment can transfer to inputs no labeler directly supervised.

On honesty and harmlessness the story is mixed, and the paper is candid about it. On TruthfulQA the PPO models are modestly but genuinely more truthful, with one exception, the 1.3B PPO-ptx, which is slightly worse than a GPT-3 of the same size. On toxicity, when asked for a respectful answer InstructGPT is less toxic than GPT-3; with no such instruction it is about the same; and when explicitly asked to be toxic, it is more toxic than GPT-3, because it follows instructions well. On bias it is not an improvement at all. The model became obedient, and obedience cuts both ways.

The economics are the quiet argument for doing any of this. Aligning the model is cheap next to building it. Training the 175B SFT model took about 4.9 petaflops/s-days, and the 175B PPO-ptx model about 60, against the 3,640 petaflops/s-days that GPT-3 itself cost. Spending a fraction of a percent more compute on alignment bought more user-rated helpfulness than a 100-times increase in model size would have. That is the sentence that moved budgets.

The word "aligned" does more work here than it should, so be exact about what it means. InstructGPT is aligned to the preferences of about 40 contractors, hired through Upwork and Scale AI, following instructions written by the researchers, on prompts drawn from OpenAI's own API users. Those labelers agreed with each other roughly 73% of the time, so even "what people prefer" is a noisy target. The paper does not claim this is the right group to align to. It claims something narrower and more useful: that the technique can align a model to a specific human reference group, and that doing so is cheap and effective. Who that group should be is left open.

Step back and the shape is simple. You cannot write "be helpful" as a loss, so you learn it. Demonstrate the behavior, collect cheap comparisons, distill them into a reward model, and optimize against that reward while a leash keeps you from gaming it and a pretraining mix keeps you from forgetting how to read. None of the four pieces was invented here. Pointed at what people actually ask, together they made a small model people preferred to a giant, and that is most of why the assistants that followed exist. If you want the sequel that drops the reward model entirely, it is Direct Preference Optimization.

Provenance Verified against primary literature
Ziegler et al. (2019)First to pair a learned reward model with PPO to fine-tune a language model.
Stiennon et al. (2020)The full SFT then reward-model then PPO recipe, on summarization.
Christiano et al. (2017)RLHF's root: a reward from preferences. Atari and robotics, A2C and TRPO, not PPO.
Gao et al. (2022)Reward over-optimization quantified as scaling laws in the KL distance.
correctionInstructGPT applies an existing recipe; it did not invent RLHF. And the reward over-optimization curve belongs to Stiennon (2020) and Gao (2022), not to InstructGPT, which only cites over-optimization to motivate the KL penalty.

Questions you might still have

?

Why not just collect more demonstrations instead of all this reinforcement learning?
Demonstrations cap the model at imitating its labelers, and writing a good answer for every prompt is slow and expensive. Judging two answers is faster than writing one, and often more reliable. The reward model turns that cheap judgment into a score for millions of samples the labelers never see.

?

Is the KL penalty the same thing as PPO clipping?
No, and the paper keeps them separate. PPO clips each update relative to the previous policy, its own trust region. The KL penalty here is added into the reward and measures distance from the frozen SFT model. Two different leashes pointing at two different anchors.

?

Why would a 1.3B model beat one 130 times larger?
Not by being a better language model. Alignment changes what the model tries to do. On the thing labelers actually score, following the intent of the prompt, a small obedient model beats a big one that is only continuing text. The 175B GPT-3 is a better autocomplete, not a better assistant.

?

Does this make the model truthful and unbiased?
Only partly. Truthfulness improves modestly. Toxicity drops when the model is asked to be respectful and rises when it is asked to be toxic. Bias does not improve. The model is aligned to about 40 labelers and the instructions they were given, not to truth in the abstract.

Footnotes & further reading

  1. The paper: Ouyang, Wu, Jiang, Almeida, Wainwright, et al., Training language models to follow instructions with human feedback (OpenAI, NeurIPS 2022).
  2. The recipe InstructGPT follows: Stiennon et al., Learning to summarize from human feedback (2020), and Ziegler et al., Fine-Tuning Language Models from Human Preferences (2019), the first to pair a learned reward model with PPO for a language model.
  3. The root of RLHF: Christiano et al., Deep Reinforcement Learning from Human Preferences (2017), which learned a reward from preferences for Atari and simulated robotics using A2C and TRPO, not PPO.
  4. The optimizer: Schulman et al., Proximal Policy Optimization Algorithms (2017). We have an explainer of PPO if you want the clip in detail.
  5. Reward over-optimization, quantified: Gao, Schulman, Hilton, Scaling Laws for Reward Model Overoptimization (2022), measured against the square root of the KL distance from the initial policy.
  6. The base model: Brown et al., Language Models are Few-Shot Learners (GPT-3, 2020), source of the 175B model and the 3,640 petaflops/s-days figure. See also our GPT-3 explainer.
  7. The pairwise-preference model behind equation (1) is Bradley & Terry (1952); the paper frames it as a cross-entropy loss on the reward gap rather than by name. The reward-model-free alternative is Direct Preference Optimization (Rafailov et al., 2023).