VerifiedarXiv:1706.0374126 min
Reinforcement learning · Alignment

Deep Reinforcement Learning from Human Preferences

An agent learns the goal from people comparing short clips of its behavior, with no reward function to write.

Writing a reward by hand is brittle and easy to game. Instead a person just says which of two short clips looks better, and a few thousand of those judgments teach the agent tasks a programmer could never script.

Explaining the paperDeep Reinforcement Learning from Human PreferencesChristiano, Leike, Brown, Martic, Legg, Amodei · OpenAI · DeepMind · NeurIPS 2017 · arXiv:1706.03741

Show a person two one-second clips of a robot flailing and ask only "which is closer to a backflip?" Do that under a thousand times, and the robot learns to do backflips, with nobody ever writing down what a backflip is worth.

Reinforcement learning needs a reward: a number, handed to the agent every step, that says how well it is doing. For games and simulated chores that number is usually easy to get, the score on the screen or the distance walked, and a decade of progress rode on that convenience. But most things people actually want are not like that. Try to write a reward for "clean the table" or "drive politely" and you end up with a function of the robot's raw sensors that you can only guess at, and the agent will happily satisfy your guess while violating what you meant. That gap between the reward you wrote and the behavior you wanted is the practical face of AI alignment.

This 2017 paper from OpenAI and DeepMind takes the obvious escape hatch and makes it work at scale: don't write the reward, learn it, by watching a human react to the agent's behavior. It is the paper that seeded RLHF, reinforcement learning from human feedback, the recipe behind InstructGPT and the chat models that followed. Two things had to be true for it to be practical. The human can only be asked a tiny number of questions, because a person is far more expensive than a simulator. And the questions have to be ones a non-expert can answer reliably. Both pushed the authors to the same interface: instead of scoring behavior, the human just compares it. Comparing does not, on its own, stop the agent from gaming the reward it learns; that is a separate problem, handled by keeping the human in the loop while the agent trains, and a later section comes back to it. What comparing buys is the goal in the first place: a way for a non-expert to hand over a target they could never have scored or coded.

Across Atari games and simulated robots, the agent reaches roughly the performance of ordinary reinforcement learning while never seeing the real reward, on feedback covering under 1% of its experience. And it learns genuinely new tricks, a Hopper doing repeated backflips, from about an hour of a person's time. A few ideas explain how: comparing clips instead of scoring them, turning those comparisons into a reward with the Bradley-Terry model, keeping the human in the loop while the agent trains, and optimizing the learned reward with a policy-gradient method robust to its constant drift.

Some goals have no reward to write down

Start with what reinforcement learning assumes. An agent and an environment take turns: at each step the agent sees an observation oto_t and picks an action ata_t, and the environment hands back a reward rtr_t. The agent's job is to maximize the discounted sum of those rewards over time. Everything the agent learns about what you want flows through that one scalar.

When the reward is a true description of the goal, this works beautifully. The trouble is that for most interesting goals the reward is not given, and a hand-written stand-in tends to be a leaky description of what you meant. The paper's own framing of the desired solution is precise: a way to solve tasks where you can only recognize the behavior you want, not necessarily demonstrate it or write it down. A backflip is the clean example. You know one when you see one. You probably cannot demonstrate one yourself for a many-jointed simulated robot, which rules out imitation learning, and you certainly cannot write the reward function whose maximum is a clean backflip-and-land.

So the goal lives in your head, and the only reliable channel to it is your reaction to what the agent does. The whole design problem becomes: how do you read a goal out of a human cheaply enough, and reliably enough, that a deep RL agent needing thousands of hours of practice can still afford to ask?

Compare clips, don't score them

The first decision sets everything else. The human is never asked to put a number on behavior. They are shown two short video clips of the agent, one to two seconds each, and asked only which one is better. They can also answer that the two are equally good, or that they cannot tell.

Asking for a comparison instead of a score is the choice that makes non-expert feedback usable, and it is worth being concrete about why. People are bad at absolute judgments and good at relative ones. Ask someone to rate a clip of a half-trained walker out of ten and you get noise that drifts between sessions and between labelers. Ask them which of two walkers looks better and the answer is stable, because a comparison only has to get the sign of the difference right, not a calibrated magnitude. The optometrist's "better, or worse? one, or two?" works for exactly this reason: you cannot name your prescription in diopters, but you can always say which lens is sharper. The paper's ablation confirms the intuition on the continuous-control tasks, where predicting comparisons clearly beat predicting numeric scores. The true reward can swing by orders of magnitude across states, so a network forced to regress those absolute scores chases a target dominated by a few huge values and fits the rest poorly, while a comparison needs only the gap's sign, which does not care about scale. (On Atari the rewards were clipped to their sign already, which removes the scale problem, and there the two were a wash.)

The unit being compared is a trajectory segment, or clip: a short window of the agent's experience, written σ=((o0,a0),,(ok1,ak1))\sigma = \big((o_0,a_0),\dots,(o_{k-1},a_{k-1})\big), a run of kk observation-action pairs. Write σ1σ2\sigma^1 \succ \sigma^2 to mean the human preferred clip 1. Why a clip and not a single frame? Because a still frame usually cannot tell you what is happening. One photo of a Pong paddle does not reveal whether it is about to score or whiff; a brief clip shows the motion and the outcome. The paper found longer clips more informative per clip but less informative per frame, and that evaluation time grows with length, so they used the shortest clip a person could still judge well, around 25 Atari frames (1.7 seconds) or 1.5 seconds of robot motion.

Each answer is stored as a triple (σ1,σ2,μ)(\sigma^1, \sigma^2, \mu) in a database D\mathcal{D}, where μ\mu records the verdict as a distribution over the two clips. A clear preference puts all of μ\mu's mass on the winner. "Equally good" sets μ\mu to a 50/50 split, a soft label that will gently pull the two clips' scores together. "Cannot compare" is different: that pair is dropped from D\mathcal{D} and never trains anything. With the interface settled, the question becomes how a pile of these little votes turns into a reward the agent can chase.

From a vote to a reward

The reward model r^\hat r is an ordinary network that scores a single step, r^(o,a)\hat r(o,a). To score a whole clip, the paper just adds up the per-step scores along it, si=tr^(oti,ati)s_i = \sum_t \hat r(o^i_t, a^i_t). Two clips give two totals, s1s_1 and s2s_2. Now you need a rule that turns those totals into a prediction of which one a human will pick, and the rule has a long pedigree.

It is the Bradley-Terry model (1952), the standard way to turn pairwise wins into scores. It says the probability clip 1 is preferred is its total's share of the exponentiated totals:

P^[σ1σ2]=exptr^(ot1,at1)exptr^(ot1,at1)+exptr^(ot2,at2)\hat P\big[\sigma^1 \succ \sigma^2\big] = \frac{\exp \sum_t \hat r(o^1_t,a^1_t)}{\exp \sum_t \hat r(o^1_t,a^1_t) + \exp \sum_t \hat r(o^2_t,a^2_t)}
(1)

Divide top and bottom by exp(s1)\exp(s_1) and the numerator becomes 1 and the denominator 1+e(s1s2)1 + e^{-(s_1 - s_2)}, which is exactly the logistic sigmoid of the difference of the two totals:

P^[σ1σ2]=σ(s1s2)=11+e(s1s2),si=tr^(oti,ati)\hat P\big[\sigma^1 \succ \sigma^2\big] = \sigma(s_1 - s_2) = \frac{1}{1 + e^{-(s_1 - s_2)}}, \qquad s_i = \sum_t \hat r(o^i_t, a^i_t)

This is the same shape as a chess Elo rating, where your chance of winning depends only on your rating minus your opponent's, run through an S-curve. Each clip's total sis_i plays the part of an Elo rating, and the human's vote is the game result. The figure below is the model with nothing hidden: drag the gap s1s2s_1 - s_2 and watch the predicted preference slide along the curve. A large gap is a near- certain vote; a gap of zero is a coin flip.

Figure 1 · the preference model
P = 77%
The chance a human prefers clip 1 is a sigmoid of the gap between the two clips' summed rewards. Drag the gap. The bright curve is the version the loss actually uses: 0.9σ+0.050.9\,\sigma + 0.05, flattened at the amber floors near 0.05 and 0.95 because the paper assumes a 10% chance the human answers at random.

Notice that only the sum over the clip appears, with no discounting inside it. The discount factors you will meet later (γ=0.99\gamma = 0.99 and friends) belong to policy optimization, where future reward is worth less than present reward; here the model treats the human as indifferent about when in the clip the good moments happen, so it just totals them.

With a probability in hand, fitting r^\hat r is plain supervised learning. You want the model's predicted preference to match the human's vote, so you minimize the cross-entropy between them:

loss(r^)= ⁣ ⁣(σ1,σ2,μ)Dμ(1)logP^[σ1σ2]+μ(2)logP^[σ2σ1]\mathrm{loss}(\hat r) = -\!\!\sum_{(\sigma^1,\sigma^2,\mu)\in\mathcal{D}} \mu(1)\,\log \hat P\big[\sigma^1 \succ \sigma^2\big] + \mu(2)\,\log \hat P\big[\sigma^2 \succ \sigma^1\big]
(2)

This is not a loosely-chosen classification loss. It is exactly the negative log-likelihood of the Bradley-Terry model, so minimizing it is maximum-likelihood estimation of the reward that best explains the votes. If you have seen logistic regression, you have seen this: the feature is the reward gap s1s2s_1 - s_2, the label is the human's choice, and the only twist is that the feature is itself the output of a deep network being trained by the same loss.

One detail in Figure 1 is the paper's own, and it earns its keep. A pure sigmoid drives toward 0 or 1 at large gaps, which says a confident model should be certain a human will agree. But people slip: they misclick, or glance away, at some small constant rate that does not vanish no matter how obvious the answer. So the paper assumes a 10% chance the human just answers at random, which turns the prediction into

P^=0.9σ(s1s2)+0.05    [0.05,0.95].\hat P = 0.9\,\sigma(s_1 - s_2) + 0.05 \;\in\; [0.05,\,0.95].

Bounding the probability off 0 and 1 caps the loss a single comparison can contribute (it can never exceed log0.053-\log 0.05 \approx 3) and keeps the gradient finite, so one confidently-wrong human label cannot blow up training. It is a small piece of robustness, separate from how ties are handled, and it is the kind of detail that separates a method that works in a demo from one that survives real labelers.

What a preference cannot tell you

Because only the gap s1s2s_1 - s_2 enters the prediction, there are things about the reward the model can never learn, and naming them explains two engineering choices that would otherwise look arbitrary. Add the same constant cc to every reward, and since each clip's total is a sum over its kk steps, that total shifts by kckc. But the gap between any two clips of the same length is unchanged, so every predicted preference is identical:

r^r^+c        s1s2  unchanged        P^  unchanged.\hat r \to \hat r + c \;\;\Longrightarrow\;\; s_1 - s_2 \;\text{unchanged} \;\;\Longrightarrow\;\; \hat P\;\text{unchanged}.

The reward is pinned down only up to an additive constant. The overall scale is loose too: multiply every reward by some positive number and you sharpen or flatten every sigmoid without ever changing which clip wins, and preferences alone cannot say which scale is right. A thermostat that reads only temperature differences has the same blind spot, indifferent to where you set zero and to the size of a degree. The paper removes both freedoms by normalizing r^\hat r to zero mean and a fixed standard deviation, which is also why the ensemble (more on that later) averages predictors only after normalizing each one independently: without that, two predictors that agree on every preference could sit at wildly different levels and their average would be meaningless.

What survives all that ambiguity is the shape of the reward, and the shape is all the agent needs. The figure below makes a hidden reward function and lets you fit it from comparisons alone. Press play and watch the learned curve climb out of nothing and lock onto the true peaks and valleys, while its absolute height never settles, which is exactly why both curves are drawn centered. A constant offset like that changes no decision the agent will ever make.

Figure 2 · fitting a reward from comparisons
110 cmp
A hidden true reward over a space of behaviors, recovered by fitting the Bradley-Terry loss to pairwise comparisons. Scrub the number of comparisons and the learned reward converges to the true shape. Its height is never pinned down, so both curves are mean-centered: comparisons fix the shape, not the level.

This is also why comparisons are enough to do real work in the first place: you were never going to get an absolute reward scale out of a human anyway.

Three processes, running at once

Put the pieces in motion and the method is a loop of three processes that run at the same time, each feeding the next. A policy π\pi acts in the environment and is trained by reinforcement learning to maximize the predicted reward r^\hat r. Pairs of clips from its behavior are sent to a human, who compares them. The comparisons train r^\hat r, which flows back to the policy. Nothing ever consults a real environment reward.

The figure makes the rates visible, and those rates are why this is practical at all. The policy runs flat out, taking millions of environment steps, while the human answers a few thousand comparisons total. Click through the three boxes to see what each consumes and produces, and watch the counter: on the Atari runs that works out to roughly one human label per ten thousand environment steps, a few thousand labels against tens of millions of steps. That ratio, well under a percent, is the concrete form of the paper's "feedback on under 1% of the agent's interactions."

Figure 3 · the asynchronous loop
The policy trains on the predicted reward and produces clips; a human compares pairs; the reward model is refit and handed back. The three run concurrently and at very different speeds: millions of environment steps against a few thousand human labels.

Running the three asynchronously, rather than in lockstep, is a throughput choice. A synchronous loop would stall the simulator every time it waited on a human, who answers only a few queries a minute; with the processes on separate clocks the policy keeps training while labels trickle in and the reward model refits in the background. The simulator does not wait for the human, and the human does not wait for the simulator. Trajectories pile up for the labeler to sample from; comparisons pile up for the reward model to fit; the freshest reward model is grabbed by the policy whenever a new one is ready. That is separate from the question of whether the human keeps labeling as the agent learns, which the next section shows matters just as much. Two of the three numbers that matter, environment steps and reward-model updates, are cheap and run continuously; only the third, human labels, is scarce, and the rest of the system is built to wring the most out of each one.

Chasing a reward that keeps moving

The policy half of the loop is, in the authors' words, a traditional RL problem: once r^\hat r gives you a reward each step, you can hand it to a standard algorithm. But the choice of algorithm is not free, because of one subtlety that runs through the entire paper. The reward is non-stationary. The thing being optimized is itself being retrained, concurrently, on fresh comparisons, so the landscape the policy is climbing keeps reshaping under its feet. Here "non-stationary" means the learned reward changes, not the environment, whose physics are fixed.

That rules out leaning on a method that memorizes the value of every state for a fixed reward, the way the DQN family does, because a change in what reward means makes much of that memorized value wrong all at once. Policy-gradient methods are a better fit: they nudge the policy directly toward whatever currently scores well, taking small steps from recent experience, so when the reward shifts the next batch adjusts. It is hill-climbing in fog where the hill itself slowly reshapes, and small local steps tolerate that far better than a global height-map does.

So the paper picks two policy-gradient methods off the shelf, one per domain. Atari uses A2C, advantage actor-critic: a "doer" that picks actions and a "critic" that scores how much better than average each action turned out, with the policy nudged in proportion to that advantage. (A2C is the synchronous version of A3C, from Mnih et al. 2016, which introduced the asynchronous original.) The robots use TRPO, trust region policy optimization: take the largest improving step you can while staying close enough to the current policy that the estimate behind the step is still trustworthy. The one knob the authors retuned was TRPO's entropy bonus, the small reward a policy gets for keeping its action choices spread out rather than committing early. With the reward still moving, premature commitment is dangerous: the policy can lock onto behavior that scored well a moment ago and stop exploring before the reward has settled, so the authors turned this bonus up.

Provenance Verified against primary literature
Bradley & Terry (1952)The paired-comparison model behind Eq (1): a logistic of the reward gap.
Mnih et al. (2016)A3C / A2C, the actor-critic optimizing the policy on Atari.
Schulman et al. (2015)TRPO, the trust-region policy optimizer used on the robotics tasks.
Amodei et al. (2016)Concrete Problems in AI Safety: the reward-hacking framing this paper cites.
Wilson; Akrour et al. (2012)Earlier preference-based RL this work scales up to deep networks.
correctionThe popular account that this first RLHF paper used PPO is wrong: PPO did not exist yet. It used A2C for Atari and TRPO for robotics. PPO arrived about five weeks later (arXiv:1707.06347) and joined the RLHF recipe only with later work like InstructGPT.

It is worth dwelling on why the PPO association sticks. The modern RLHF stack is so welded to PPO that the optimizer can feel load-bearing, as if learning a reward from preferences and running PPO were one idea. They are not. The part that mattered, comparisons to a reward to a policy, was already complete here in 2017; the box at the end just held A2C or TRPO, and PPO slid into that same slot later without changing anything upstream of it.

Why the human stays in the loop

One failure makes the whole architecture necessary. Suppose you collect a batch of comparisons, fit r^\hat r once, freeze it, and only then turn the RL optimizer loose. It breaks, and it breaks in an instructive way. A frozen reward model is only trustworthy where it was trained: on the behaviors the human actually saw and compared. A reinforcement learner is an optimizer, and an optimizer is, in effect, an adversarial search for the highest-scoring behavior it can find. It drifts into the corners of behavior space the reward model never saw, where the model is unconstrained and happens to score high, and it parks there. Proxy reward climbs while the true goal goes unserved.

The figure makes it tangible. The amber curve is the true reward, peaking at the real goal. The teal curve is the learned reward the agent actually climbs. In offline mode the learned reward is frozen after early data and keeps rising past the real goal, so as the agent ascends it the proxy number goes up while the true reward it achieves falls off a cliff. Switch to online mode and fresh comparisons correct the model wherever the agent currently is, so the false slope never forms and the agent settles on the real goal.

Figure 4 · online versus offline
0%
The agent climbs the learned reward, not the true reward. Offline, the frozen model keeps rising past the goal, so the agent's proxy score climbs while what it actually achieves collapses. Online, the model is corrected wherever the agent goes, and it lands on the goal. Toggle the mode and scrub the run.

The paper's own demonstration is the Pong ablation, and it is worth walking through because you can watch it happen. Trained offline, the agent learns that losing a point is bad and drives its rate of losing to zero, but it never learns that scoring is good, because the frozen reward did not pin that down where the agent ended up. So it volleys forever, keeping the ball in play in an endless rally that scores nothing. Keeping the human in the loop heads this off: as the agent invents the volley, the labeler is shown those new clips and corrects the reward there. In the paper's words, "human feedback needs to be intertwined with RL learning rather than provided statically."

This is the pattern that later got the names reward hacking and Goodhart's law, that a proxy optimized hard enough stops tracking the thing it proxied. The paper does not use those words; it describes a predictor that "captures only part of the true reward" and attributes the danger to the shifting distribution of states the agent visits, citing Concrete Problems in AI Safety (Amodei et al. 2016) for the framing. The loop already answers it: it keeps collecting comparisons on the agent's current behavior, so the reward model's training distribution chases the agent into whatever corner it wanders toward.

Asking the questions that matter

If a human label is your scarcest resource, you should not spend it on a comparison whose answer you can already guess. The paper's heuristic for picking which clip pairs to show is to use disagreement among an ensemble. Train not one reward predictor but three, each on its own resampled view of the data. Where the humans have labeled, the three agree, pinned to the comparisons they have seen; where labels are sparse, the three fan out, because nothing constrains them there. Sample many candidate clip pairs, and send the human the one where the ensemble's predictions disagree most, which is a crude stand-in for "the pair we are most uncertain about."

Click anywhere in the figure to drop a label and watch the disagreement collapse there; the marker hunts for the next widest gap. Press play to let it pick the worst spot itself, round after round.

Figure 5 · query selection by disagreement
uncertainty sampling
Three reward predictors agree where labels exist and fan into a wide disagreement band where they do not. The next query is placed at the widest part of the band. Click to add a label and the disagreement there collapses.

The ensemble does two jobs at once. It picks better questions, and, because each member is normalized and the three are averaged, it gives a steadier reward estimate than any single network would, which matters when that estimate is the only signal the agent ever gets. The paper does not oversell the trick. The disagreement heuristic is crude, and the ablations show it is not a clean win: on some tasks, choosing queries this way does worse than picking them at random. The authors flag it as a rough approximation to the value-of-information question they would rather be answering, and leave the better version to future work.

Feedback on under one percent

With the loop assembled, the headline result is how little of it the human touches. On the simulated robots, eight MuJoCo tasks, 700 human comparisons are enough to nearly match ordinary reinforcement learning trained on the true reward, which the agent never sees. Push the synthetic-feedback version (an oracle that answers from the true reward) to around 1400 labels and it slightly beats training on the true reward, apparently because the learned reward is a touch better shaped: it hands out a little positive credit for behaviors that tend to precede success, smoothing the path. On the Ant task, real human feedback beat the oracle outright, because the labelers were told to prefer clips where the robot stayed upright, and that turned out to be useful shaping the hand-written bonus had missed.

On Atari, the same seven games as the original DQN paper, it is harder. With 5,500 human comparisons the agent shows substantial learning on most games and matches or beats RL on a few, but trails it on others; Breakout and Space Invaders improve but never catch up, while Enduro, which is hard for the RL baseline to explore at all, comes out ahead because human labelers reward any progress toward passing cars. A useful comparison the paper draws is that real human feedback often performs about as well as synthetic feedback with 40% fewer labels, a measure of how much noisier a real person is than a perfect oracle.

The novel behaviors carry the most weight, because they have no reward function to fall back on at all. A Hopper learns a sequence of backflips, landing upright each time, from 900 comparisons in under an hour of human time. A Half-Cheetah learns to run forward on one leg from 800. An Enduro car learns to ride alongside other cars rather than race past them from about 1,300. None of these has an obvious reward you could have written, and all of them came from a person watching clips and clicking.

What it looks like in practice

Make the Atari case concrete. The reward predictor takes the same input as the policy, four stacked 84×8484\times84 frames, through four convolutional layers (filters 7×77\times7, 5×55\times5, 3×33\times3, 3×33\times3 at strides 3, 2, 1, 1, 16 channels each, leaky ReLU), a 64-unit dense layer, and a single scalar output per frame-stack, with batch norm and dropout to keep it from overfitting the small pile of labels. To score a clip of 25 steps you run that net over the clip and sum the 25 scalars into one total sis_i. A pair of clips gives s1,s2s_1, s_2; the predicted preference is 0.9σ(s1s2)+0.050.9\,\sigma(s_1 - s_2) + 0.05; the human's vote and the cross-entropy do the rest:

# one reward-model step from a single labeled clip pair
seg1, seg2, mu = sample_comparison(D)   # mu: soft label over {1, 2}
s1 = r_hat(seg1).sum()                  # sum per-step reward over the clip
s2 = r_hat(seg2).sum()                  # no discount inside a clip
p  = 0.9 * sigmoid(s1 - s2) + 0.05      # 10% chance of a random answer
loss = -(mu[0] * log(p) + mu[1] * log(1 - p))    # cross-entropy
loss.backward()                         # gradient flows into r_hat only

That predictor's output is normalized (to standard deviation 0.05 on Atari, 1 on the robots) and handed to the policy optimizer, A2C here with 16 parallel workers, γ=0.99\gamma = 0.99, an entropy bonus, and Adam. The exact spread is arbitrary, since it is the very scale preferences cannot pin down, so each domain just uses whatever value sat well with its optimizer's step size; what matters is that the rewards reach the optimizer re-centered to zero mean and a fixed spread. The three processes share the database D\mathcal{D} and the latest r^\hat r and otherwise run on their own clocks:

# three processes run at once, sharing D and r_hat
policy:   roll out pi, train it by A2C / TRPO on r_hat   # millions of steps
labeler:  pick the highest-variance clip pair, ask a human   # a few thousand
reward:   refit r_hat to D by the loss above, normalize, hand it back

A few production details keep it stable: an ensemble of three predictors, 1/e of the data (about a third) held out per predictor as validation with the regularization tuned to keep validation loss between 1.1 and 1.5 times training loss (a deliberately mild overfit band, since at a ratio near 1 the model is leaving signal on the table and much above 1.5 it is memorizing the few thousand labels), a buffer that keeps only the last 3,000 labels so the model stays current, and a labeling rate that anneals as training proceeds (frequent early, when the reward matters most, sparse later). Put numbers to the headline claim: roughly 5,500 labels over a 50-million-step Atari run is on the order of a hundredth of a percent of frames labeled, at the cost of a few dollars of compute and a few dollars of a person's time. The paper frames the same saving from the overseer's side as cutting the human interaction needed to specify a task by roughly three orders of magnitude.

Step away from the machinery and the contribution is a clean separation. The hard, expensive thing, knowing what you want, is handled by a person doing something people are good at: comparing. The cheap, scalable thing, optimizing relentlessly, is handled by a machine doing what machines are good at. A reward model glued between them turns a few thousand comparisons into a signal an agent can chase for millions of steps, and the agent comes out able to do backflips nobody could have scored.

That separation is exactly what the language-model world picked up a few years later. The reward model here, Bradley-Terry plus cross-entropy, is the same one that scores answers in InstructGPT and the assistants built on it; what that recipe added was a pretrained model to start from, PPO in the optimizer slot, and a leash holding the policy near its starting point so it cannot wander into the reward model's blind spots, the same blind-spot problem you just watched in Figure 4. Direct Preference Optimization later folded the reward model back into the policy and skipped the RL loop entirely, but it is solving the very same Bradley-Terry objective written down in 2017. The residual idea, learn the reward from comparisons instead of writing it, has outlived every optimizer that has carried it.

Questions you might still have

?

Did this paper use PPO?
No. It used A2C (synchronous A3C) for Atari and TRPO for robotics. PPO was submitted about five weeks later, in July 2017, and joined the RLHF recipe only with later language-model work like InstructGPT. The "RLHF means PPO" association is real but postdates this paper.

?

If the agent never sees the true reward, how do we know it worked?
The true reward is withheld from the agent but used by the researchers as a yardstick, alongside a synthetic oracle and a normal RL baseline trained on it. With 700 comparisons the agent nearly matched that baseline on the MuJoCo tasks, and on a few it did better.

?

Why ask for comparisons instead of numeric scores?
People give inconsistent absolute numbers but consistent comparisons, because a comparison only needs the sign of the difference, not a calibrated value. The ablations confirm it: comparisons clearly beat numeric targets on continuous control, where the reward scale varies a lot.

?

Why must the reward model be trained online, not once up front?
A frozen reward model is trustworthy only where it was trained. A free optimizer drifts into the gaps and finds behavior the model overrates, like the agent that volleys forever in Pong without scoring. Live feedback re-grounds the model wherever the agent actually goes. The reward-hacking framing is from Amodei et al. 2016, which the paper cites.

?

How does this connect to ChatGPT-style RLHF?
This is where RLHF started. The reward model, Bradley-Terry fit by cross-entropy, carries straight over. The language-model recipe added a pretrained start, PPO, and a KL penalty keeping the policy near a reference. DPO later collapsed the reward model into the policy. The explainers on InstructGPT, PPO, and DPO trace that lineage forward.

Footnotes & further reading

  1. The paper: Christiano, Leike, Brown, Martic, Legg, Amodei, Deep Reinforcement Learning from Human Preferences (OpenAI / DeepMind, NeurIPS 2017). Christiano and Amodei were at OpenAI; Leike, Martic, and Legg at DeepMind; Brown is listed without an institution.
  2. The preference model is Bradley and Terry, Rank Analysis of Incomplete Block Designs (1952), the binary case of the Luce choice rule, and the same logistic-of-a-rating-difference shape as the Elo chess system (Elo's own original curve was Gaussian; the logistic form came later).
  3. Policy optimizers: A2C is the synchronous form of A3C from Mnih et al., Asynchronous Methods for Deep Reinforcement Learning (2016); robotics used TRPO (Schulman et al., 2015). PPO (Schulman et al., arXiv:1707.06347) appeared about five weeks later and is not used here.
  4. The reward-hacking and distributional-shift framing is from Amodei et al., Concrete Problems in AI Safety (2016), which the paper cites for the offline-training failure.
  5. A note on the Atari results numbers. The paper's body says its 5,500-label human run was compared to "350, 700, or 1400 synthetic queries," but that phrase is a copy-paste slip from the identical MuJoCo sentence: it is contradicted two sentences later ("only 3,300 such labels") and by the ablations, which use 5,500 synthetic labels. The synthetic Atari curves span the low thousands up to about ten thousand labels; the MuJoCo counts of 350 / 700 / 1400 are correct.
  6. The lineage to language models runs through Ziegler et al. 2019, Stiennon et al. 2020 (summarization from human feedback), InstructGPT (Ouyang et al. 2022), and DPO (Rafailov et al. 2023). They reuse the same Bradley-Terry reward-model skeleton, specialized to whole text completions, and add the pretrained start, PPO, and KL-to-reference that this paper did not need.