Direct Preference Optimization: Your Language Model is Secretly a Reward Model
The reward model was hiding inside the policy.
RLHF fits a reward model, then runs reinforcement learning against it. DPO proves you can skip both. The same preference data trains the language model directly, with one classification loss, and the reward model falls out for free.
Explaining the paperDirect Preference Optimization: Your Language Model is Secretly a Reward ModelWhat if the reward model and the reinforcement learning were both unnecessary, and you could read the reward straight off the policy you were training anyway?
A pretrained language model knows a great deal and wants nothing. It will finish your sentence, but it has no preference for the helpful finish over the rude one, the safe one over the dangerous one. Teaching it those preferences is the alignment problem, and the standard answer, the one behind ChatGPT and InstructGPT, is RLHF: reinforcement learning from human feedback. It works. It is also a three-stage machine with a reputation for being finicky. You fine-tune a base model, then train a separate reward model to imitate human taste, then run a reinforcement-learning algorithm (usually PPO) that samples from the policy, scores each sample with the reward model, and nudges the policy uphill, over and over, while a leash keeps it from wandering too far from where it started.
Direct Preference Optimization (DPO, from Stanford) removes the middle and the end of that pipeline. No reward model. No reinforcement learning. No sampling from the model during training. You take the same pairs of responses a human ranked, and you train the language model on them with a loss that looks like ordinary classification. The claim in the title is the reason it works: the language model you are training is a reward model. The reward is not a separate network you fit and throw away. It is a simple function of the policy's own probabilities, and once you see that, the entire RLHF objective rearranges into something you can minimize with gradient descent and nothing else.
Getting there takes a short tower of ideas, and none of them is heavy. How a pairwise preference becomes a number (a reward). What RLHF is really maximizing. Why that objective has a known optimal solution in closed form. And the one algebraic move, inverting that solution, that turns the whole thing into a classification loss. We build them in order.
Aligning by trial and reward
Start with what you have: a model that can write, and a pile of human judgments. The judgments are comparisons. A human sees a prompt and two responses, picks the better one, and you record the winner and the loser . Nobody assigns a score. People are bad at "rate this 7.3 out of 10" and good at "this one is better," so preference data is pairwise by design.
RLHF turns that pile into an aligned model in three stages. Stage one, supervised fine-tuning (SFT): train the base model on good example responses so it at least answers in the right format. Call the result , the reference policy. Stage two, fit a reward model , a separate network that reads a prompt and a response and outputs a scalar "how good," trained so that it scores the human-preferred response higher. Stage three, reinforcement learning: let the policy generate responses, score them with , and push the policy toward higher-scoring output, with a penalty for drifting too far from .
That third stage is the expensive, temperamental one. It samples from the model thousands of times per step, keeps two large networks live (the policy and the reward model), and PPO has a thicket of knobs that have to be set right or training quietly diverges. DPO's whole pitch is to keep stage one, delete stages two and three, and replace them with a single supervised loss. The figure below is the before and after.
Preferences are a logistic curve
Before we can train anything we need to connect "a human picked this one" to a number. The standard bridge is the Bradley-Terry model, from 1952, the same idea behind chess Elo. Suppose every response has a hidden quality score, a reward . The model says the chance a human prefers over is set by the gap between their rewards, run through a softmax of two terms:
The two forms are the same thing. Divide top and bottom by and the right side collapses to the logistic sigmoid of the reward difference. Read it as a statement about confidence. If the two responses have equal reward, the gap is zero, , and the human is a coin flip. As the winner pulls ahead, the probability of preferring it climbs an S-curve toward one but never quite gets there, because humans are noisy and sometimes pick the worse answer. Only the difference of rewards matters, which is a fact we will lean on hard later.
This gives you a way to fit a reward model. You have a dataset of triples . Pick the reward function that makes the observed preferences most likely. Under Bradley-Terry, the probability of each observed " beat " is , so maximizing the likelihood means minimizing its negative log:
That is RLHF's stage two in one line: a binary classifier that learns to score winners above losers. Hold onto its shape, the log-sigmoid of a reward difference, because the punchline of this whole post is that DPO's loss is this exact expression with the reward written a different way.
What RLHF actually maximizes
Now stage three. With a reward model in hand, you want a policy that earns high reward. But naively maximizing reward is a trap. The reward model is an imperfect imitation of human taste, and a policy free to chase it will find the cracks: degenerate text that scores high and reads like nonsense. (Reward hacking. Goodhart's law with a GPU.) So RLHF adds a leash. Maximize reward, but stay close to the reference policy you started from, measured by KL divergence:
Unpack the two terms. The first rewards good output. The second is a penalty, the KL divergence, which measures how far the new policy's distribution has moved from the reference. The coefficient sets the strength of the leash. The KL term is what keeps the policy fluent and on-distribution while it shifts toward what humans prefer, and it is the reason DPO needs to keep around at all.
This objective looks like it demands reinforcement learning, because is sampled from (the expectation is over the policy's own output, which moves as you train, which is exactly what makes it an RL problem). RLHF solves it with PPO. But this particular objective, reward minus a KL penalty, is special. It has a known optimal solution you can write down in one line.
The optimal policy has a formula
Forget gradient descent for a moment and just ask: of all possible policies, which one maximizes (3)? This is a textbook result (it shows up under names like "control as inference" and the Gibbs variational principle), and the answer is clean. The best policy reweights the reference by the exponential of the reward:
Read it as a tilt. Start with the reference distribution . For each response, multiply its probability by , which boosts high-reward responses and suppresses low-reward ones. Then divide by the partition function , a per-prompt number that rescales everything back into a valid probability distribution that sums to one:
The role of is now vivid. When is large, is small, the exponential is near one, and the optimal policy barely budges from (a strong leash). When is small, is huge, the exponential is sharp, and the optimal policy piles all its mass on the single highest-reward response (no leash, pure greed). Drag and watch the reference distribution tilt:
So why does anyone bother with PPO if the optimum is a formula? Because the formula is useless as written. To actually sample from you need , and is a sum over every possible response to the prompt, an astronomically large set. Estimating it is, in the paper's words, expensive enough to make this representation hard to use in practice. The optimum is known but unreachable. RLHF gives up on the formula and grinds toward it with sampling. DPO does something cleverer.
The reward was in the policy all along
Here is the move the whole paper turns on. Equation (4) expresses the optimal policy in terms of the reward. Solve it the other way: express the reward in terms of the policy. Take the log of both sides of (4) and rearrange for :
Stare at this. It says any reward function can be rewritten as times the log-ratio of its own optimal policy to the reference, plus a term that depends only on the prompt. The reward and the policy are two encodings of the same thing. If you know the optimal policy you know the reward, exactly, up to that offset.
And the offset does not matter. Look back at the Bradley-Terry model (1): preferences depend only on the difference of two rewards for the same prompt. Plug (5) into that difference. The term depends on alone, so it is identical for and , and it cancels:
The intractable partition function is gone. The thing that made (4) unusable was the one thing that did not survive the subtraction. What is left is a reward difference written entirely in quantities you can compute: a forward pass of the policy and a forward pass of the frozen reference. (This works for any reward in the Bradley-Terry equivalence class. Two rewards that differ by a prompt-only function induce the same preferences and the same optimal policy, and the log-ratio parameterization can represent every such class. That is Theorem 1 in the paper. The offset is exactly one of those harmless shifts.)
DPO: one classification loss
We are one substitution from the end. Recall the reward-model loss (2): a log-sigmoid of a reward difference, fit to the preference data. We just wrote that reward difference (6) purely in terms of the policy. So substitute. Instead of fitting a separate reward model and then optimizing a policy against it, parameterize the reward by the policy and fit it directly to the preferences. The reward-model loss becomes a loss on :
This is DPO. The entire method. It is the reward-model objective (2) with the reward replaced by , the policy's own implicit reward. There is no reward network, no sampling, no RL loop. You have a fixed dataset of preference pairs, and you minimize (7) by plain gradient descent, the same way you would train any classifier. The quantity
is the implicit reward: the reward model defined by the language model and the reference, never trained on its own but emerging the instant you optimize (7). That is the secret reward model of the title. When DPO finishes, you have an aligned policy and, baked into it, a reward function consistent with the human preferences, readable off the log-ratios whenever you want it.
It helps to picture the loss as a landscape. Forget the prompt-only offset and track two numbers: how much the policy raises the winner above the reference, , and how much it raises the loser, . The loss only cares about the margin . It is low in the upper-left, where the policy lifts the winner and pushes down the loser, and high in the lower-right, where it has them backwards.
What the update does
The loss is tidy, but the gradient is where the intuition lives. Differentiate (7) and you get
Two pieces. The bracket on the right is the direction: increase the log-probability of the winner, decrease the log-probability of the loser. Every preference pair pushes the policy the same way, up on , down on . The scalar in front is the interesting part. It is , the probability that the model's current implicit reward ranks the pair backwards. When the policy already scores the winner well above the loser, that sigmoid is near zero and the example is nearly ignored. When the policy has the pair wrong, scoring the loser higher, the weight approaches one and the example gets the full push.
This is the built-in safeguard that hand-tuned RLHF tricks try to bolt on. DPO spends its gradient on the examples it currently gets wrong and coasts past the ones it already has right, automatically, because the weight is the model's own error. Drag the margin and watch the weight and the push respond:
What it looks like in practice
Let me make the abstractions concrete with real shapes. A response is a sequence of tokens, say 20 of them. The model gives a probability to each next token, and is the sum of those per-token log-probabilities over the whole response. (The official code sums, it does not average, by default. That detail matters: longer responses accumulate more negative log-prob, so the raw log-probs you compare are sums.)
# log pi(y|x) is the SUM of per-token log-probs over the response (average_log_prob=False)
logp = logits.log_softmax(-1).gather(-1, labels).squeeze(-1) # per-token log-prob
logp = (logp * loss_mask).sum(-1) # sum over the response tokens
# so for a 20-token answer, log pi(y|x) is a sum of 20 negative numbers, e.g. about -34So for each preference pair you run four forward passes: the policy and the reference, each on the winner and the loser, giving four scalar log-probs. The loss is built from those four numbers and nothing else.
# trainers.py: the entire DPO loss (beta = 0.1 for dialogue)
pi_logratios = policy_chosen_logps - policy_rejected_logps # log pi_theta(y_w) / pi_theta(y_l)
ref_logratios = reference_chosen_logps - reference_rejected_logps # same ratio under pi_ref
logits = pi_logratios - ref_logratios # = a_w - a_l, the implicit margin / beta
losses = -F.logsigmoid(beta * logits) # eq (7): a binary classification loss
# the implicit rewards, for logging only (detached):
chosen_rewards = beta * (policy_chosen_logps - reference_chosen_logps).detach()
rejected_rewards = beta * (policy_rejected_logps - reference_rejected_logps).detach()Walk one example through it with . Suppose the reference assigns summed log-probs and (the reference slightly prefers the loser, which is why we are correcting it). After some training the policy has moved to and . Then the policy log-ratio is , the reference log-ratio is , and . With that is , so the loss is . The implicit rewards are and , so the policy now scores the winner above the loser by a margin of . The gradient weight on this example, , is already shrinking: the model mostly has this pair right and will spend less on it next time.
Two practical notes the code pins down. First, where does come from? It is the SFT checkpoint, kept frozen. If you do not have an SFT model for your data, the paper's fallback is to make one by fine-tuning the base model on the preferred responses alone, , so the reference is at least on-distribution. Second, : the released config has no default and the example commands use for dialogue and summarization. The paper did not meaningfully tune it, which is part of the "no significant hyperparameter tuning" claim.
The same file ships two optional variants worth knowing. Setting label_smoothing to a small gives conservative DPO, which assumes a fraction of the human labels are flipped and softens the loss so the model never tries to drive any margin to infinity. Setting reference_free drops entirely (treats it as uniform), which is simpler but loses the leash. The default is plain DPO with neither.
So what does it actually do
It matches or beats RLHF while being far simpler to run. On controlling the sentiment of generated text, where you can measure the true reward-versus-KL frontier exactly, DPO sits on the best frontier at every KL budget, ahead of PPO: more reward for the same drift from the reference. On TL;DR summarization, judged by GPT-4 against human reference summaries, DPO wins about 61% of the time at temperature 0, against PPO's 57%. On Anthropic's single-turn dialogue, DPO is the one efficient method that improves on the preferred completions in the data. And DPO holds up across sampling temperatures where PPO degrades, which matters because a brittle-to-temperature method is a brittle method.
The reasons it is easier to live with are structural. There is no reward model to train, store, or overfit. There is no sampling from the policy during training, so a training step is a forward and backward pass on fixed data, like any supervised job. There is no PPO, so the long list of RL knobs is gone; the main dial is . The cost is four forward passes per pair instead of one, and you keep a frozen reference model in memory during training. At inference you ship one ordinary language model.
The limits are honest ones. DPO trains on a fixed set of preferences (it is off-policy), so it never sees the responses the improving model would actually generate, the way online PPO does. That can matter when the preference data is far from what the tuned model produces, and it has motivated online and iterative variants that refresh the pairs as the policy moves. DPO inherits Bradley-Terry's assumptions, including that a single scalar reward explains the comparisons. And like all of these methods it is only as aligned as the preferences it is fed; garbage preferences in, confidently-aligned garbage out. None of that has slowed it down. DPO is now a default first move for preference tuning, precisely because the reward model it removed was never a separate thing. It was a way of reading the policy. We just stopped fitting it twice.
Questions you might still have
If there is no separate reward model, what plays its role?
The policy itself. The quantity beta times log(pi_theta/pi_ref) is an implicit reward: train the policy with the DPO loss and that quantity becomes a reward model consistent with the preference data, for free. That is the title.
Why does the intractable partition function Z(x) not matter?
Preferences depend only on reward differences, and Z(x) depends only on the prompt x, so it is identical for the winner and the loser and cancels in the subtraction. DPO never has to compute it.
Does DPO still need a reference model at run time?
Only during training, to compute the log-ratio. You keep a frozen copy of the SFT model as pi_ref. At inference you ship only pi_theta, an ordinary language model with no extra machinery.
What is beta doing?
It is the same KL temperature as in RLHF. Large beta keeps the policy close to pi_ref (timid updates); small beta lets preferences move it hard. The released code uses beta = 0.1 for dialogue and summarization.
Footnotes & further reading
- The paper: Rafailov, Sharma, Mitchell, Ermon, Manning, Finn, Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Stanford, NeurIPS 2023). Code.
- The RLHF pipeline DPO collapses: Ouyang et al., Training language models to follow instructions with human feedback (InstructGPT), and Stiennon et al., Learning to summarize from human feedback.
- The pairwise-comparison model: Bradley & Terry, Rank Analysis of Incomplete Block Designs (1952).
- The closed-form optimum of the KL-regularized reward objective (control as inference / the Gibbs variational principle): see e.g. Levine, Reinforcement Learning and Control as Probabilistic Inference.
- The conservative-DPO label-smoothing variant and the IPO objective both live in the same preference_loss function; IPO is Azar et al., A General Theoretical Paradigm to Understand Learning from Human Preferences.
How could this explainer be improved? Found an error, or something unclear? I read every message.