RL & alignment · LLMs

Training language models to follow instructions with human feedback

A 1.3B model, aligned with human feedback, beats one 100× its size.

GPT-3 was trained to predict the next word, not to do what you ask. InstructGPT fixes the objective with three steps and a crowd of human judges, and the result is a smaller model people would rather talk to.

Explaining the paperTraining language models to follow instructions with human feedbackOuyang, Wu, Jiang, et al. (OpenAI) · NeurIPS 2022 · arXiv:2203.02155 ↗

What does it take to turn a model that finishes your sentence into one that answers your question?

Bigger is not more obedient

GPT-3 is trained on one objective: given a stretch of internet text, predict the next token. Do that across a few hundred billion tokens and you get something remarkable at exactly one skill, continuing text in a way that looks plausible. What you do not get is an assistant. Ask the raw model a question and it might answer it, or it might list three more questions, or imitate a forum thread it half-remembers, because all of those are plausible continuations of a question on the internet.

This is the alignment gap: the objective we can train (predict the next token) is a stand-in for the objective we actually want (follow the user's instructions helpfully, honestly, and harmlessly), and the two come apart in the open. Scaling the model up does not close the gap, it only sharpens the text-continuer: a 175B model is better autocomplete, not a better assistant.

InstructGPT is OpenAI's answer to that gap. The method itself is not new. The paper says so plainly: it follows Ziegler et al. (2019) and Stiennon et al. (2020), who fine-tuned language models from human preferences for stylistic continuation and summarization, which in turn build on Christiano et al. (2017), the paper that first learned a reward from human preferences (for game-playing and simulated robots, with no language and no PPO in sight). What InstructGPT does is point that machinery at the messy, open-ended distribution of things people actually type into an API, and show it works there.

The headline result reordered priorities. A 1.3B InstructGPT model, one hundred times smaller than the 175B GPT-3, produces answers that labelers prefer to the larger model's. Same model family, far fewer parameters, and better liked, because it was trained against the objective people care about. The rest of this piece is how that happens.

Three terms will recur, so here is what each one means. Helpful means it does what you asked. Honest means it does not make things up. Harmless means it declines to do damage. These are the axes the labelers were asked to judge, and everything below exists to push the model along them.

The plan: three steps

You cannot write down "be helpful" as a loss function. So InstructGPT learns it in three steps, each one handing something to the next. Click through them:

Figure 1 · the pipeline

SFT

Step 1 fine-tunes GPT-3 to imitate demonstrations. Step 2 trains a reward model on ranked comparisons. Step 3 optimizes the policy on fresh prompts to maximize that reward, held close to the step-1 model. Steps 2 and 3 can repeat on the improved policy.

Step 1, supervised fine-tuning (SFT). Labelers write good answers to a batch of prompts, and GPT-3 is fine-tuned to copy them. This is ordinary supervised learning: show the model the desired behavior and nudge its weights toward producing it. The paper used about 13,000 demonstration prompts. Oddly, the SFT model overfits its validation loss after a single epoch, yet training it longer (16 epochs) still improves the downstream reward and human ratings, so they keep going past the point where the validation loss would tell them to stop.

SFT gets you a model that follows instructions reasonably well. The other two steps also start from it. The reward model in step 2 is built on top of the SFT model, and in step 3 both the policy being trained and the reference it is held against are copies of the SFT model.

So why not stop here and collect more demonstrations? Because demonstrations have a ceiling. They teach the model to imitate what a labeler would write, and no more. To get past that ceiling you need a way to tell the model that one answer is better than another, including answers no labeler wrote. That is steps 2 and 3, and the reason they exist starts with a fact about human effort.

Comparing beats writing

Judging is cheaper than producing. It is faster to taste two dishes and say which is better than to cook a great one. It is faster to read two answers and pick the stronger than to write the strong one yourself. And the judgment is often more reliable, because you can recognize quality you could not generate.

InstructGPT leans on this hard. Instead of asking labelers to write answers, it shows them several answers the model already produced and asks them to rank them. A single ranking is generous with data. Rank $K$ responses and you can read off the winner of every pair at once, which is $\binom{K}{2}$ comparisons from one act of judgment. The paper shows labelers between four and nine responses per prompt. Drag $K$ and watch the harvest grow:

Figure 2 · one ranking, many pairs

responses K15 pairs

Ranking

K

responses to a prompt yields

\binom{K}{2}

pairwise comparisons, from 6 at

K{=}4

to 36 at

K{=}9

. Each chord is one training pair for the reward model, and all the pairs from a prompt are fed as a single batch element.

The obvious thing to do with all these pairs is to throw them into one pile and shuffle. The paper tried that and the reward model overfit after a single pass, because the $\binom{K}{2}$ pairs from one prompt are highly correlated (they reuse the same handful of responses), so shuffling lets the model see each response in up to $K-1$ separate gradient updates. Treating all of a prompt's comparisons as one batch element fixes it: one forward pass per response, every pair scored from those same scores. It stops overfitting and it is less expensive. With about 33,000 ranked prompts, this is the data the reward model learns from.

A model of human taste

A pile of "A beats B" judgments is not yet something you can optimize against. You cannot run reinforcement learning on a lookup table of past comparisons; the model will produce new answers that are in nobody's table. So InstructGPT trains a second network, the reward model, to predict the judgments. Feed it a prompt and a response and it returns a single number, a scalar reward $r_\theta$ , that stands in for "how much would a labeler like this."

How do you turn pairwise preferences into a scalar? With a model that is older than deep learning. Give each response a score, and say the probability a human prefers the winner is the logistic of the score gap. This is the Bradley-Terry model (1952); the paper does not call it that, but that is what it is, framed instead as a cross-entropy loss in which the reward difference is the log-odds of preference:

P(y_w \succ y_l) = \sigma\big(r_\theta(x, y_w) - r_\theta(x, y_l)\big) = \frac{e^{\,r_\theta(x, y_w)}}{e^{\,r_\theta(x, y_w)} + e^{\,r_\theta(x, y_l)}}

Only the gap $r_w - r_l$ matters. For a concrete case: if the model scores response A at $5$ and response B at $3$ , the gap is $2$ , and the model says a labeler prefers A with probability $\sigma(5 - 3) = \sigma(2) \approx 0.88$ . The score gap is the log-odds of the preference: a gap of $2$ means A is favored about $0.88 / 0.12 \approx 7$ to one. That is the standard way to model a noisy judgment, one party is better by some amount and the verdict goes their way more often the larger that amount, never with certainty. When the two scores are equal the preference is a coin flip at fifty percent; as the winner pulls ahead, the probability rises along an S-curve toward one. Drag the gap and watch the preference follow:

Figure 3 · the reward gap becomes a preference

gap r_w−r_lP(win) = 77%

The chance a labeler prefers the winner is

\sigma(r_w - r_l)

, the logistic of the reward gap. A gap of zero is a coin flip; a positive gap bends the vote split toward the winner. Training the reward model means choosing scores so this matches the labelers.

To train the reward model you minimize the negative log-likelihood of the labelers' choices under this model. That is equation (1), with the $1/\binom{K}{2}$ averaging each prompt's pairs:

\operatorname{loss}(\theta) = -\frac{1}{\binom{K}{2}}\, \mathbb{E}_{(x,\,y_w,\,y_l)\sim D}\Big[\log \sigma\big(r_\theta(x, y_w) - r_\theta(x, y_l)\big)\Big]

(1)

It says one thing: make the preferred response score higher than the dispreferred one, and pay a penalty, $-\log\sigma(r_w - r_l)$ , that shrinks toward zero as the gap grows the right way and blows up when the model ranks a pair backwards. In code it is a loop over a prompt's pairs, scored from one forward pass per response:

# reward model: one batch element = all C(K,2) pairs from ONE prompt
resps  = rank_responses(prompt)       # K=4..9, labeler-ranked best->worst
scores = reward_model(prompt, resps)  # one scalar reward per response
loss = 0
for w, l in winner_loser_pairs(resps):    # every (preferred, dispreferred)
    loss += -log_sigmoid(scores[w] - scores[l])
loss = loss / num_pairs               # the 1 / C(K,2) average
loss.backward()                       # one forward pass per response

Two practical notes. First, the loss depends only on the gap, so adding a constant to every reward changes nothing. The scores are determined only up to a shift. Before RL begins they pin the scale down by adding a bias so the labeler demonstrations average a reward of zero, giving the numbers a fixed reference point.

Second, the size. You might expect the reward model to be the 175B model. It is not. They only use 6B reward models, and one single 6B model is reused across every policy size. Two reasons drove the choice. Training a 175B reward model was unstable, which made it a poor initialization for the value function PPO needs (PPO trains a second network, the value function, to estimate how much reward a state is worth; here it is naturally warm-started from the reward model, which already scores responses), and a 175B reward model plus value function would have ballooned the compute of step 3. The 6B model was stable and led to equally strong policies.

Chase the reward

With a reward model in hand, step 3 is reinforcement learning. The setup is a one-shot bandit (a single decision with no follow-on states, unlike multi-step RL): a prompt arrives, the model (now the policy $\pi^{\mathrm{RL}}$ ) writes a full response, the reward model scores it, and the episode ends. The policy's job is to produce responses the reward model scores highly. The algorithm that does the optimizing is PPO (Schulman et al., 2017), the same one with its own explainer here. The value function PPO needs is initialized from the reward model, which already encodes the shape of "good."

If reward were all that mattered, you would maximize $r_\theta$ . It is not, and a second term protects against that. The objective InstructGPT actually maximizes is equation (2):

\operatorname{objective}(\phi) = \mathbb{E}_{(x,y)\sim D_{\pi^{\mathrm{RL}}_\phi}}\!\Big[\, r_\theta(x, y) - \beta \log \frac{\pi^{\mathrm{RL}}_\phi(y \mid x)}{\pi^{\mathrm{SFT}}(y \mid x)} \,\Big] + \gamma\, \mathbb{E}_{x\sim D_{\mathrm{pretrain}}}\big[\log \pi^{\mathrm{RL}}_\phi(x)\big]

(2)

Three pieces. The first, $r_\theta(x,y)$ , is the reward we want to climb. The third, the $\gamma$ term, is a separate fix, addressed below where the alignment tax is covered; for plain "PPO" models $\gamma$ is set to zero. The middle term is the penalty that holds the policy near where it started:

-\,\beta \log \frac{\pi^{\mathrm{RL}}_\phi(y \mid x)}{\pi^{\mathrm{SFT}}(y \mid x)}

That log-ratio is, in expectation over tokens the policy samples, the KL divergence from the frozen SFT model. (Strictly: a single token's log-ratio is not a KL value and can even be negative; it is an unbiased estimator of the KL only on average. The paper adds it as a per-token penalty.) The KL measures how far the policy has moved: it is zero when the policy assigns text the same probabilities the SFT model does, and it grows as the policy shifts probability mass onto text the SFT model considered unlikely. The penalty does not pin any particular output; it charges for total redistribution. Subtracting it with coefficient $\beta$ means: chase the reward, but pay for every step you take away from where you started. The figure below lays that trade out as curves, drag $\beta$ and watch where the peak of the combined objective settles. Why you would ever want to hold the model back is the next section.

Figure 4 · where the optimum settles

KL coeff βx* ≈ 3.0

The two terms of equation (2), over a schematic 1D policy space (distance from SFT on the horizontal axis). The reward proxy rises and flattens; the KL penalty

-\beta\,\mathrm{KL}

bows downward from the SFT anchor and accelerates. Their sum, the bright combined objective, peaks at finite distance: the dot. A bigger

\beta

deepens the penalty and pulls the optimum back toward SFT; a smaller one lets it settle far out. The paper charges the log-ratio per token, so the penalty accrues across the full generation. Curve shapes are schematic, not measurements; the shape is the claim.

The reward PPO receives at each token is the model's score minus that penalty, assembled like this:

# step 3: build the reward PPO actually optimizes, per token
y      = policy.sample(x)             # policy = pi_RL, initialized from SFT
r_end  = reward_model(x, y)           # one scalar, credited at the last token
# per-token KL penalty: stay close to the frozen SFT reference
kl     = beta * (logp(policy, y, x) - logp(sft, y, x))
reward = r_end - kl                   # the shaped reward handed to PPO
# PPO-ptx only: also raise log-prob on a batch of pretraining text
loss_ptx = -gamma * logp(policy, pretrain_batch)

There are two distinct uses of KL in this story, and they do not measure the same thing. PPO has its own mechanism, a clipped surrogate that limits how far each update moves the policy from the previous step's policy. The KL penalty above is different: it measures distance from the fixed SFT model and is folded into the reward, not the optimizer. PPO's clip keeps each step small; the KL penalty keeps the total distance from SFT small.

Both are needed because they guard against different failures. PPO's clip is a per-update trust region: it stops any single gradient step from lurching the policy somewhere the reward estimate cannot be trusted, which keeps the optimization stable. But many small, individually-clipped steps still compound, and the clip does not constrain where the steps started. A run that takes a thousand tiny, well-behaved steps can still walk the policy a long way from the SFT model it began as. The KL penalty bounds that cumulative distance, which the clip does not: it charges for total drift from SFT no matter how gently you accumulated it, so the policy can chase reward but pays a growing penalty the further it strays from where it started.

Why you cannot just maximize it

The reward model is not human judgment. It is a 6B-parameter guess at human judgment, trained on a finite cloud of (prompt, response) pairs, and its judgment is trustworthy only near that cloud. Push the policy far enough outside it and the score becomes extrapolation: optimize hard enough and the policy will find responses that score high not because they are good, but because they exploit the reward model's blind spots, outputs the reward model rates highly for no reason a human would endorse. This is Goodhart's law: a measure that becomes a target stops being a good measure.

The picture below plots the reward and quality curves. As the policy moves away from SFT (measured as KL distance, on the horizontal axis), the proxy reward the model reports keeps climbing. But the true quality a human would assign rises, peaks, and then falls. Past the peak you are no longer improving the model; you are teaching it to game the proxy. Drag $\beta$ to change how far the policy may move from SFT and watch where the optimum lands:

Figure 5 · the KL penalty and over-optimization

KL coeff βKL* ≈ 7.2

The proxy reward rises with distance from SFT; true quality peaks then falls (the shaded over-optimization zone). The objective

r - \beta\,\mathrm{KL}

is maximized at a finite distance. A bigger

\beta

holds the policy closer to SFT; too small and the policy runs past the peak. The turn-over curve is the documented phenomenon (Stiennon 2020; Gao 2022), shown schematically, not a plot from InstructGPT.

This is exactly what $\beta$ buys. The KL penalty ignores reward entirely; it depends only on distance from the frozen SFT model. Adding it bends the objective so its maximum sits at a finite KL, short of the cliff where true quality collapses. Set $\beta$ too high and the policy is held so close to SFT that it barely improves. Set it too low and the policy sprints past the peak into the over-optimized region, scoring beautifully on the reward model and badly with people. InstructGPT cites over-optimization as the reason the term is there at all.

The turn-over of true quality in the figure is a real, documented phenomenon, but it is not InstructGPT's own result. The canonical version is in Stiennon et al. (2020), and it was later quantified as clean scaling laws by Gao, Schulman, and Hilton (2022), who measured it against the square root of the KL distance. The curve above is schematic, drawn to show the shape, not to report numbers.

The alignment tax

Steps 1 to 3 produce a far more helpful model. They also cost something. When you push a model hard toward one objective, it can get worse at the things that objective does not mention. The paper measured this directly: a plain PPO model regresses against GPT-3 on a spread of public NLP benchmarks, including SQuADv2, DROP, HellaSwag, and WMT 2015 French-to-English translation. The authors call this the alignment tax, an extra cost you pay for aligning the model. If the tax is high, people will keep using the unaligned-but-capable model, which is the opposite of what you want.

The $\gamma$ term from equation (2), the one we set aside, pays the tax back. The regressions say the policy is losing general language competence it had absorbed in pretraining, so the remedy is to keep rehearsing it: during RL, mix in batches of the original pretraining data and reward the model for keeping its log-probability on them high. This pulls the policy back toward being a competent language model while the reward term pulls it toward being helpful. Models trained this way are called PPO-ptx ("ptx" for pretraining mix), and unless the paper says otherwise, "InstructGPT" means PPO-ptx. Toggle between the two:

Figure 6 · paying the tax back

PPO: helpful, but NLP drops below GPT-3

Both PPO and PPO-ptx are far more helpful than GPT-3. But on public NLP benchmarks plain PPO drops below the GPT-3 baseline (the tax), while PPO-ptx recovers to it. Bar heights are schematic; the directions follow the paper's Figures 1 and 29.

The improvement is real but narrower than a slogan, in two respects. Adding the pretraining mix mitigates the regressions on all of these datasets, and on HellaSwag it even surpasses GPT-3, but on DROP, SQuADv2, and translation it recovers toward the baseline rather than beating it. And the paper checked the obvious alternative, fighting the tax by cranking up the KL coefficient $\beta$ to keep the policy closer to GPT-3: mixing in pretraining works better, since raising $\beta$ costs reward and never fully recovers DROP and SQuAD. The two terms in equation (2) do different jobs: $\beta$ fights over-optimization, $\gamma$ pays the alignment tax.

What the labelers preferred

On the metric the method targets, it works convincingly. Labelers compared outputs head to head. The 175B InstructGPT is preferred to the 175B GPT-3 about 85% of the time, and preferred 71% of the time even to a GPT-3 that has been hand-prompted into instruction-following mode. The widely cited result is the small-model comparison: the 1.3B InstructGPT, a hundred times smaller, is still preferred to the 175B GPT-3. Switch matchups:

Figure 7 · who wins the head-to-head

Labeler win-rate, drawn as opposing bars. The 175B InstructGPT beats the 175B GPT-3 85 ± 3% of the time and a few-shot GPT-3 71 ± 4%. The result that traveled: the 1.3B InstructGPT still beats the 175B GPT-3, a 100× smaller model. (The exact rate for that last matchup is not reported, so it is shown as above 50%.)

The helpfulness shows up on concrete axes too. Compared to GPT-3, InstructGPT more often attempts the right instruction, follows explicit constraints ("answer in two paragraphs"), and makes up fewer facts on closed-domain tasks. It even generalizes a little past its training: it can follow instructions in languages and answer questions about code, despite those being a tiny slice of the fine-tuning data, which is a hint that alignment can transfer to inputs no labeler directly supervised.

On honesty and harmlessness the story is mixed, and the paper is candid about it. On TruthfulQA the PPO models are modestly but genuinely more truthful, with one exception, the 1.3B PPO-ptx, which is slightly worse than a GPT-3 of the same size. On toxicity, when asked for a respectful answer InstructGPT is less toxic than GPT-3; with no such instruction it is about the same; and when explicitly asked to be toxic, it is more toxic than GPT-3, because it follows instructions well. On bias it is not an improvement at all.

The economics also favor doing this. Aligning the model is cheap next to building it. Training the 175B SFT model took about 4.9 petaflops/s-days, and the 175B PPO-ptx model about 60, against the 3,640 petaflops/s-days that GPT-3 itself cost. Spending a fraction of a percent more compute on alignment bought more user-rated helpfulness than a 100-times increase in model size would have.

The word "aligned" does more work here than it should, and the precise version is narrower. InstructGPT is aligned to the preferences of about 40 contractors, hired through Upwork and Scale AI, following instructions written by the researchers, on prompts drawn from OpenAI's own API users. Those labelers agreed with each other roughly 73% of the time, so even "what people prefer" is a noisy target. The paper does not claim this is the right group to align to. It claims something narrower and more useful: that the technique can align a model to a specific human reference group, and that doing so is inexpensive and effective. Who that group should be is left open.

The method works around one impossibility. You cannot write "be helpful" as a loss, so you learn it. Demonstrate the behavior, collect low-cost comparisons, distill them into a reward model, and optimize against that reward while a KL penalty keeps you from gaming it and a pretraining mix keeps you from forgetting how to read. None of the four pieces was invented here. Pointed at what people actually ask, together they made a small model people preferred to a giant, and that is most of why the assistants that followed exist. If you want the sequel that drops the reward model entirely, it is Direct Preference Optimization.

Provenance Verified against primary literature

Ziegler et al. (2019)First to pair a learned reward model with PPO to fine-tune a language model.

Stiennon et al. (2020)The full SFT then reward-model then PPO recipe, on summarization.

Christiano et al. (2017)RLHF's root: a reward from preferences. Atari and robotics, A2C and TRPO, not PPO.

Gao et al. (2022)Reward over-optimization quantified as scaling laws in the KL distance.

correctionInstructGPT applies an existing recipe for RLHF (reinforcement learning from human feedback); it did not invent it. And the reward over-optimization curve belongs to Stiennon (2020) and Gao (2022), not to InstructGPT, which only cites over-optimization to motivate the KL penalty.

Questions you might still have

Why not just collect more demonstrations instead of all this reinforcement learning?
Demonstrations cap the model at imitating its labelers, and writing a good answer for every prompt is slow and expensive. Judging two answers is faster than writing one, and often more reliable. The reward model turns that quick judgment into a score for millions of samples the labelers never see.

Is the KL penalty the same thing as PPO clipping?
No, and the paper keeps them separate. PPO clips each update relative to the previous policy, its own trust region. The KL penalty here is added into the reward and measures distance from the frozen SFT model. Two different constraints measured against two different reference points.

Why would a 1.3B model beat one 130 times larger?
Not by being a better language model. Alignment changes what the model tries to do. On the thing labelers actually score, following the intent of the prompt, a small obedient model beats a big one that is only continuing text. The 175B GPT-3 is a better autocomplete, not a better assistant.

Does this make the model truthful and unbiased?
Only partly. Truthfulness improves modestly. Toxicity drops when the model is asked to be respectful and rises when it is asked to be toxic. Bias does not improve. The model is aligned to about 40 labelers and the instructions they were given, not to truth in the abstract.

Footnotes & further reading

The paper: Ouyang, Wu, Jiang, Almeida, Wainwright, et al., Training language models to follow instructions with human feedback (OpenAI, NeurIPS 2022).
The recipe InstructGPT follows: Stiennon et al., Learning to summarize from human feedback (2020), and Ziegler et al., Fine-Tuning Language Models from Human Preferences (2019), the first to pair a learned reward model with PPO for a language model.
The root of RLHF: Christiano et al., Deep Reinforcement Learning from Human Preferences (2017), which learned a reward from preferences for Atari and simulated robotics using A2C and TRPO, not PPO.
The optimizer: Schulman et al., Proximal Policy Optimization Algorithms (2017). We have an explainer of PPO if you want the clip in detail.
Reward over-optimization, quantified: Gao, Schulman, Hilton, Scaling Laws for Reward Model Overoptimization (2022), measured against the square root of the KL distance from the initial policy.
The base model: Brown et al., Language Models are Few-Shot Learners (GPT-3, 2020), source of the 175B model and the 3,640 petaflops/s-days figure. See also our GPT-3 explainer.
The pairwise-preference model behind equation (1) is Bradley & Terry (1952); the paper frames it as a cross-entropy loss on the reward gap rather than by name. The reward-model-free alternative is Direct Preference Optimization (Rafailov et al., 2023).

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.