Alignment · LLMs

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

The reward model was inside the policy all along.

RLHF fits a reward model, then runs reinforcement learning against it. DPO proves you can skip both. The same preference data trains the language model directly, with one classification loss, and the reward model comes baked into the result.

Explaining the paperDirect Preference Optimization: Your Language Model is Secretly a Reward ModelRafailov, Sharma, Mitchell, Ermon, Manning, Finn · Stanford · NeurIPS 2023 · arXiv:2305.18290 ↗

The reward model and the reinforcement learning are both unnecessary: the reward can be read straight off the policy you were training anyway.

A pretrained language model has broad capabilities but no judgment. It will finish your sentence, but it has no preference for the helpful finish over the rude one, the safe one over the dangerous one. Teaching it those preferences is the alignment problem, and the standard answer, the one behind ChatGPT and InstructGPT, is RLHF: reinforcement learning from human feedback. It works. It is also a three-stage machine with a reputation for being finicky. You fine-tune a base model, then train a separate reward model to imitate human taste, then run a reinforcement-learning algorithm (usually PPO) that samples from the policy, scores each sample with the reward model, and nudges the policy uphill, over and over, while a leash keeps it from drifting too far from where it started.

Direct Preference Optimization (DPO, from Stanford) removes the middle and the end of that pipeline. No reward model. No reinforcement learning. No sampling from the model during training. You take the same pairs of responses a human ranked, and you train the language model on them with a loss that looks like ordinary classification. The title says why it works: the language model you are training is a reward model. The reward is not a separate network you fit and throw away. It is a simple function of the policy's own probabilities, and once you see that, the entire RLHF objective rearranges into something you can minimize with gradient descent and nothing else.

Getting there takes a few ideas. How a pairwise preference becomes a number (a reward). What RLHF is really maximizing. Why that objective has a known optimal solution in closed form. And the one algebraic move, inverting that solution, that turns the objective into a classification loss. We build them in order.

Aligning by trial and reward

You begin with a model that can write and a pile of human judgments. The judgments are comparisons. A human sees a prompt $x$ and two responses, picks the better one, and you record the winner $y_w$ and the loser $y_l$ . Nobody assigns a score. People are bad at "rate this 7.3 out of 10" and good at "this one is better," so preference data is pairwise by design.

RLHF turns that pile into an aligned model in three stages. Stage one, supervised fine-tuning (SFT): train the base model on good example responses so it at least answers in the right format. Call the result $\pi_{\text{ref}}$ , the reference policy. Stage two, fit a reward model $r_\phi(x,y)$ , a separate network that reads a prompt and a response and outputs a scalar "how good," trained so that it scores the human-preferred response higher. Stage three, reinforcement learning: let the policy generate responses, score them with $r_\phi$ , and push the policy toward higher-scoring output, with a penalty for drifting too far from $\pi_{\text{ref}}$ .

That third stage is the expensive, temperamental one. It samples from the model thousands of times per step, keeps two large networks live (the policy and the reward model), and PPO has a thicket of knobs that have to be set right or training diverges. DPO keeps stage one, deletes stages two and three, and replaces them with a single supervised loss. The figure below is the before and after.

Figure 1 · two pipelines

3 stages · a loop

RLHF is three stages with a sampling loop: SFT, then a separate reward model, then PPO whose policy is sampled and re-scored every step. DPO is one stage: the SFT model becomes a frozen reference and a single classification loss over the fixed preference pairs produces the aligned policy. No reward model, no sampling. Toggle between them.

Preferences are a logistic curve

Before we can train anything we need to connect "a human picked this one" to a number. The standard bridge is the Bradley-Terry model, from 1952, the same idea behind chess Elo. Suppose every response has a hidden quality score, a reward $r^*(x,y)$ . The model says the chance a human prefers $y_1$ over $y_2$ is set by the gap between their rewards, run through a softmax of two terms:

p^*(y_1 \succ y_2 \mid x) = \frac{\exp\big(r^*(x,y_1)\big)}{\exp\big(r^*(x,y_1)\big) + \exp\big(r^*(x,y_2)\big)} = \sigma\big(r^*(x,y_1) - r^*(x,y_2)\big)

(1)

The two forms are the same thing. Divide top and bottom by $\exp(r^*(x,y_1))$ and the right side collapses to the logistic sigmoid $\sigma(z) = 1/(1+e^{-z})$ of the reward difference. It is a statement about confidence. If the two responses have equal reward, the gap is zero, $\sigma(0) = 0.5$ , and the human is a coin flip. As the winner pulls ahead, the probability of preferring it climbs an S-curve toward one but never quite gets there, because humans are noisy and sometimes pick the worse answer. Only the difference of rewards matters, which is a fact we will lean on hard later.

Figure 2 · the Bradley-Terry model

gap r_w−r_lP(win) = 77%

The chance a human prefers the winner is σ of the reward gap. At gap zero it is a coin flip; a positive gap bends the preference toward the winner along the logistic curve. Drag the gap and watch the vote split. Only the difference of rewards enters, never the absolute level.

This gives you a way to fit a reward model. You have a dataset $\mathcal{D}$ of triples $(x, y_w, y_l)$ . Pick the reward function that makes the observed preferences most likely. Under Bradley-Terry, the probability of each observed " $y_w$ beat $y_l$ " is $\sigma(r(x,y_w) - r(x,y_l))$ , so maximizing the likelihood means minimizing its negative log:

\mathcal{L}_R(r_\phi, \mathcal{D}) = -\,\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}}\Big[\log \sigma\big(r_\phi(x,y_w) - r_\phi(x,y_l)\big)\Big]

(2)

That is RLHF's stage two in one line: a binary classifier that learns to score winners above losers. Hold onto its shape, the log-sigmoid of a reward difference, because DPO's loss is this exact expression with the reward written a different way.

What RLHF actually maximizes

Now stage three. With a reward model in hand, you want a policy $\pi_\theta$ that earns high reward. But naively maximizing reward is a trap. The reward model is an imperfect imitation of human taste, and a policy that maximizes it without constraint lands on the reward model's errors: degenerate text that scores high and reads like nonsense. (This is reward hacking, an instance of Goodhart's law.) So RLHF adds a constraint. Maximize reward, but stay close to the reference policy $\pi_{\text{ref}}$ you started from, measured by KL divergence:

\max_{\pi_\theta}\ \mathbb{E}_{x\sim\mathcal{D},\,y\sim\pi_\theta(\cdot\mid x)}\big[r_\phi(x,y)\big] \;-\; \beta\, \mathbb{D}_{\mathrm{KL}}\!\big[\pi_\theta(y\mid x)\,\|\,\pi_{\text{ref}}(y\mid x)\big]

(3)

The objective has two terms. The first rewards good output. The second is a penalty, the KL divergence, which measures how far the new policy's distribution has moved from the reference. The coefficient $\beta$ sets the strength of the leash. KL is a natural choice here: it is computed from the policy's own log-probabilities on the text it actually generates, no extra model or sampler needed, and it charges the policy exactly where it redistributes probability mass away from the reference, which is precisely the failure mode being guarded against. The KL term keeps the policy fluent and on-distribution while it shifts toward what humans prefer, and it is the reason DPO needs to keep $\pi_{\text{ref}}$ around at all.

This objective looks like it demands reinforcement learning, because $y$ is sampled from $\pi_\theta$ (the expectation is over the policy's own output, which moves as you train, which is exactly what makes it an RL problem). RLHF solves it with PPO. But this particular objective, reward minus a KL penalty, is special. It has a known optimal solution you can write down in one line.

The optimal policy has a formula

Set gradient descent aside and consider which of all possible policies maximizes (3). This is a textbook result (it shows up under names like "control as inference" and the Gibbs variational principle), and it has a clean closed form. The best policy reweights the reference by the exponential of the reward:

\pi_r(y\mid x) = \frac{1}{Z(x)}\,\pi_{\text{ref}}(y\mid x)\,\exp\!\Big(\tfrac{1}{\beta}\,r(x,y)\Big)

(4)

This is a tilt of the reference. Start with the reference distribution $\pi_{\text{ref}}$ . For each response, multiply its probability by $\exp(r/\beta)$ , which boosts high-reward responses and suppresses low-reward ones. Then divide by the partition function $Z(x)$ , a per-prompt number that rescales everything back into a valid probability distribution that sums to one:

Z(x) = \sum_y \pi_{\text{ref}}(y\mid x)\,\exp\!\big(r(x,y)/\beta\big)

The role of $\beta$ is now vivid. When $\beta$ is large, $r/\beta$ is small, the exponential is near one, and the optimal policy barely budges from $\pi_{\text{ref}}$ (a strong restraint). When $\beta$ is small, $r/\beta$ is huge, the exponential is sharp, and the optimal policy piles all its mass on the single highest-reward response (no restraint, pure greed). Drag $\beta$ and watch the reference distribution tilt:

Figure 3 · the KL-constrained optimum

ββ = 0.70

The optimal policy is the reference distribution reweighted by exp(r/β). Each candidate response keeps its reference probability times the exponential of its reward, then the whole thing is renormalized. Large β stays near the reference; small β collapses onto the highest-reward response. This is the exact optimum of objective (3), no RL required to state it.

So why does anyone bother with PPO if the optimum is a formula? Because the formula is useless as written. To actually sample from $\pi_r$ you need $Z(x)$ , and $Z(x)$ is a sum over every possible response to the prompt, an astronomically large set. Estimating it is, in the paper's words, expensive enough to make this representation hard to use in practice. RLHF gives up on the formula and grinds toward it with sampling. DPO takes a different route.

The reward was in the policy all along

Equation (4) expresses the optimal policy in terms of the reward. The whole paper turns on solving it the other way: express the reward in terms of the policy. Take the log of both sides of (4) and rearrange for $r$ :

r(x,y) = \beta\,\log\frac{\pi_r(y\mid x)}{\pi_{\text{ref}}(y\mid x)} + \beta\,\log Z(x)

(5)

This says any reward function can be rewritten as $\beta$ times the log-ratio of its own optimal policy to the reference, plus a term that depends only on the prompt. The reward and the policy are two encodings of the same thing. If you know the optimal policy you know the reward, exactly, up to that $\beta\log Z(x)$ offset. And notice what $Z(x)$ is: a sum over responses, but its value depends only on the prompt, not on the response. Anything that depends on $x$ alone drops out the moment two rewards for the same prompt get compared, because whatever $Z(x)$ is, it appears in both terms and subtracts away.

And the offset does not matter. Look back at the Bradley-Terry model (1): preferences depend only on the difference of two rewards for the same prompt. Plug (5) into that difference. The term $\beta\log Z(x)$ depends on $x$ alone, so it is identical for $y_w$ and $y_l$ , and it cancels:

r(x,y_w) - r(x,y_l) = \beta\,\log\frac{\pi_r(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} - \beta\,\log\frac{\pi_r(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)}

(6)

The intractable partition function is gone. The thing that made (4) unusable was the one thing that did not survive the subtraction. What is left is a reward difference written entirely in quantities you can compute: a forward pass of the policy and a forward pass of the frozen reference. (This works for any reward in the Bradley-Terry equivalence class. Two rewards that differ by a prompt-only function $f(x)$ induce the same preferences and the same optimal policy, and the log-ratio parameterization can represent every such class. That is Theorem 1 in the paper. The $\beta\log Z(x)$ offset is exactly one of those harmless shifts.)

DPO: one classification loss

Recall the reward-model loss (2): a log-sigmoid of a reward difference, fit to the preference data. We just wrote that reward difference (6) purely in terms of the policy. So substitute. Instead of fitting a separate reward model and then optimizing a policy against it, parameterize the reward by the policy and fit it directly to the preferences. The reward-model loss becomes a loss on $\pi_\theta$ :

\mathcal{L}_{\text{DPO}}(\pi_\theta;\pi_{\text{ref}}) = -\,\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}}\!\left[\log \sigma\!\left(\beta\,\log\frac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} - \beta\,\log\frac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)}\right)\right]

(7)

This is the entire DPO method. It is the reward-model objective (2) with the reward replaced by $\beta\log(\pi_\theta/\pi_{\text{ref}})$ , the policy's own implicit reward. There is no reward network, no sampling, no RL loop. You have a fixed dataset of preference pairs, and you minimize (7) by plain gradient descent, the same way you would train any classifier. The quantity

\hat{r}_\theta(x,y) = \beta\,\log\frac{\pi_\theta(y\mid x)}{\pi_{\text{ref}}(y\mid x)}

(8)

is the implicit reward: the reward model defined by the language model and the reference, never trained on its own but emerging the instant you optimize (7). This is the reward model the title names. When DPO finishes, you have an aligned policy and, baked into it, a reward function consistent with the human preferences, readable off the log-ratios whenever you want it.

The loss can be read as a landscape. Setting the prompt-only offset aside, the loss reduces to two numbers: how much the policy raises the winner above the reference, $a_w = \log(\pi_\theta(y_w)/\pi_{\text{ref}}(y_w))$ , and how much it raises the loser, $a_l = \log(\pi_\theta(y_l)/\pi_{\text{ref}}(y_l))$ . The loss depends only on the margin $a_w - a_l$ . It is low in the upper-left, where the policy lifts the winner and pushes down the loser, and high in the lower-right, where it has them backwards.

Figure 4 · the DPO loss landscape

winner a_wloser a_lgrad weight 0.73

The two axes are how far the policy lifts the winner and the loser above the reference. The loss depends only on the margin between them, so it falls toward the upper-left. The gradient arrow always points up and left, raise the winner and lower the loser, and it shrinks once the winner is safely ahead. Drag the two sliders to move the policy around the landscape.

What the update does

The loss is tidy, but the gradient is what makes the behavior intuitive. Differentiate (7) and you get

\nabla_\theta \mathcal{L}_{\text{DPO}} = -\,\beta\,\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}}\Big[\sigma\big(\hat{r}_\theta(x,y_l) - \hat{r}_\theta(x,y_w)\big)\,\big[\,\nabla_\theta \log\pi_\theta(y_w\mid x) - \nabla_\theta \log\pi_\theta(y_l\mid x)\,\big]\Big]

(9)

Two pieces. The bracket on the right is the direction: increase the log-probability of the winner, decrease the log-probability of the loser. Every preference pair pushes the policy the same way, up on $y_w$ , down on $y_l$ . The scalar in front is the interesting part. It is $\sigma(\hat{r}_\theta(x,y_l) - \hat{r}_\theta(x,y_w))$ , the probability that the model's current implicit reward ranks the pair backwards. When the policy already scores the winner well above the loser, that sigmoid is near zero and the example is nearly ignored. When the policy has the pair wrong, scoring the loser higher, the weight approaches one and the example gets the full push.

This adaptive weighting is built into the DPO loss itself, not added on top. DPO concentrates its gradient on the examples currently ranked wrong and applies almost no update to those already ranked correctly, because the weight is set by the model's own error. Drag the margin and watch the weight and the push respond:

Figure 5 · the gradient weight

margin β(r̂_w−r̂_l)weight 0.77

The update weight is σ(r̂_l − r̂_w), the chance the current model ranks the pair backwards. Examples the model already gets right (positive margin) are nearly ignored; examples it gets wrong get the full push. The right panel shows that push, scaled by the weight: raise the winner, lower the loser.

What it looks like in practice

Let me make the abstractions concrete with real shapes. A response $y$ is a sequence of tokens, say 20 of them. The model gives a probability to each next token, and $\log\pi_\theta(y\mid x)$ is the sum of those per-token log-probabilities over the response. (The official code sums, it does not average, by default. Longer responses accumulate more negative log-prob, so the raw log-probs you compare are sums.)

# log pi(y|x) is the SUM of per-token log-probs over the response (average_log_prob=False)
logp = logits.log_softmax(-1).gather(-1, labels).squeeze(-1)   # per-token log-prob
logp = (logp * loss_mask).sum(-1)                              # sum over the response tokens
# so for a 20-token answer, log pi(y|x) is a sum of 20 negative numbers, e.g. about -34

So for each preference pair you run four forward passes: the policy and the reference, each on the winner and the loser, giving four scalar log-probs. The loss is built from those four numbers and nothing else.

# trainers.py: the entire DPO loss (beta = 0.1 for dialogue)
pi_logratios  = policy_chosen_logps    - policy_rejected_logps     # log pi_theta(y_w) / pi_theta(y_l)
ref_logratios = reference_chosen_logps - reference_rejected_logps  # same ratio under pi_ref
logits = pi_logratios - ref_logratios        # = a_w - a_l, the implicit margin / beta
losses = -F.logsigmoid(beta * logits)         # eq (7): a binary classification loss
# the implicit rewards, for logging only (detached):
chosen_rewards   = beta * (policy_chosen_logps   - reference_chosen_logps).detach()
rejected_rewards = beta * (policy_rejected_logps - reference_rejected_logps).detach()

One example runs through it with $\beta = 0.1$ . Suppose the reference assigns summed log-probs $\log\pi_{\text{ref}}(y_w) = -34$ and $\log\pi_{\text{ref}}(y_l) = -31$ (the reference slightly prefers the loser, which is why we are correcting it). After some training the policy has moved to $\log\pi_\theta(y_w) = -30$ and $\log\pi_\theta(y_l) = -35$ , so its log-ratio (the preference margin) is $-30 - (-35) = 5$ ; the reference model gives $-34$ and $-31$ , a log-ratio of $-34 - (-31) = -3$ ; the logit is the difference, $\text{logits} = 5 - (-3) = 8$ . With $\beta = 0.1$ that is $\beta \cdot \text{logits} = 0.8$ , so the loss is $-\log\sigma(0.8) \approx 0.37$ . The implicit rewards are $\hat{r}_w = 0.1(-30-(-34)) = 0.4$ and $\hat{r}_l = 0.1(-35-(-31)) = -0.4$ , so the policy now scores the winner above the loser by a margin of $0.8$ . The gradient weight on this example, $\sigma(\hat{r}_l - \hat{r}_w) = \sigma(-0.8) \approx 0.31$ , is already shrinking: the model mostly has this pair right and will spend less on it next time.

Two practical notes the code pins down. First, where does $\pi_{\text{ref}}$ come from? It is the SFT checkpoint, kept frozen. If you do not have an SFT model for your data, the paper's fallback is to make one by fine-tuning the base model on the preferred responses alone, $\pi_{\text{ref}} = \arg\max_\pi \mathbb{E}[\log\pi(y_w\mid x)]$ , so the reference is at least on-distribution. Second, $\beta$ : the released config has no default and the example commands use $\beta = 0.1$ for dialogue and summarization. The paper did not meaningfully tune it, which is part of the "no significant hyperparameter tuning" claim.

The same file ships two optional variants. Setting label_smoothing to a small $\varepsilon$ gives conservative DPO, which assumes a fraction $\varepsilon$ of the human labels are flipped and softens the loss so the model never tries to drive any margin to infinity. Setting reference_free drops $\pi_{\text{ref}}$ entirely (treats it as uniform), which is simpler but loses the tether. The default is plain DPO with neither.

Matches RLHF, far simpler

It matches or beats RLHF while being far simpler to run. On controlling the sentiment of generated text, where you can measure the true reward-versus-KL frontier exactly, DPO sits on the best frontier at every KL budget, ahead of PPO: more reward for the same drift from the reference. On TL;DR summarization, judged by GPT-4 against human reference summaries, DPO wins about 61% of the time at temperature 0, against PPO's 57%. On Anthropic's single-turn dialogue, DPO is the one efficient method that improves on the preferred completions in the data. And DPO holds up across sampling temperatures where PPO degrades.

The reasons it is easier to live with are structural. There is no reward model to train, store, or overfit. There is no sampling from the policy during training, so a training step is a forward and backward pass on fixed data, like any supervised job. There is no PPO, so the long list of RL knobs is gone; the main dial is $\beta$ . The cost is four forward passes per pair instead of one, and you keep a frozen reference model in memory during training. At inference you ship one ordinary language model.

The limits follow from the design. DPO trains on a fixed set of preferences (it is off-policy), so it never sees the responses the improving model would actually generate, the way online PPO does. That can matter when the preference data is far from what the tuned model produces, and it has motivated online and iterative variants that refresh the pairs as the policy moves. DPO inherits Bradley-Terry's assumptions, including that a single scalar reward explains the comparisons. And like all of these methods it is only as aligned as the preferences it is fed; garbage preferences in, confidently-aligned garbage out. None of that has slowed it down. DPO is now a default first move for preference tuning, precisely because the reward model it removed was never a separate thing. It was a way of reading the policy.

Provenance Verified against primary literature

DPO (2023)Rafailov et al.: the reward reparameterization, the closed-form optimum, the DPO loss (eq 7), and the gradient analysis.

trainers.py (code)Official implementation. preference_loss: logits = (chosen-rejected policy log-ratio) - (chosen-rejected ref log-ratio), loss = -logsigmoid(beta*logits). _get_batch_logps sums token log-probs by default.

Bradley & Terry (1952)The logistic model of pairwise comparison: P(i beats j) depends on the difference of latent scores.

InstructGPT / RLHF (2022)Ouyang et al.: the three-stage pipeline DPO collapses (SFT, reward model, PPO).

Control as inferenceThe KL-constrained reward objective has a known Boltzmann optimum, the exponential tilt of the reference policy.

caveatThe paper has no single default for beta and the released config ships it unset (beta: ???); the README's example commands and the experiments use beta = 0.1, which is what we quote. Separately, eq (7) puts a beta inside the sigmoid on each log-ratio, while the code factors it out once (logits = pi-log-ratio - ref-log-ratio, then -logsigmoid(beta*logits)). These are algebraically identical; we teach eq (7) and show the code form in the listing.

Questions you might still have

If there is no separate reward model, what plays its role?
The policy itself. The quantity beta times log(pi_theta/pi_ref) is an implicit reward: train the policy with the DPO loss and that quantity becomes a reward model consistent with the preference data, at no extra cost. The title refers to that quantity.

Why does the intractable partition function Z(x) not matter?
Preferences depend only on reward differences, and Z(x) depends only on the prompt x, so it is identical for the winner and the loser and cancels in the subtraction. DPO never has to compute it.

Does DPO still need a reference model at run time?
Only during training, to compute the log-ratio. You keep a frozen copy of the SFT model as pi_ref. At inference you ship only pi_theta, an ordinary language model with no extra machinery.

What is beta doing?
It is the same KL temperature as in RLHF. Large beta keeps the policy close to pi_ref (timid updates); small beta lets preferences move it hard. The released code uses beta = 0.1 for dialogue and summarization.

Footnotes & further reading

The paper: Rafailov, Sharma, Mitchell, Ermon, Manning, Finn, Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Stanford, NeurIPS 2023). Code.
The RLHF pipeline DPO collapses: Ouyang et al., Training language models to follow instructions with human feedback (InstructGPT), and Stiennon et al., Learning to summarize from human feedback.
The pairwise-comparison model: Bradley & Terry, Rank Analysis of Incomplete Block Designs (1952).
The closed-form optimum of the KL-regularized reward objective (control as inference / the Gibbs variational principle): see e.g. Levine, Reinforcement Learning and Control as Probabilistic Inference.
The conservative-DPO label-smoothing variant and the IPO objective both live in the same preference_loss function; IPO is Azar et al., A General Theoretical Paradigm to Understand Learning from Human Preferences.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.