VerifiedarXiv:2305.1829024 min
Alignment · LLMs

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

The reward model was hiding inside the policy.

RLHF fits a reward model, then runs reinforcement learning against it. DPO proves you can skip both. The same preference data trains the language model directly, with one classification loss, and the reward model falls out for free.

Explaining the paperDirect Preference Optimization: Your Language Model is Secretly a Reward ModelRafailov, Sharma, Mitchell, Ermon, Manning, Finn · Stanford · NeurIPS 2023 · arXiv:2305.18290

What if the reward model and the reinforcement learning were both unnecessary, and you could read the reward straight off the policy you were training anyway?

A pretrained language model knows a great deal and wants nothing. It will finish your sentence, but it has no preference for the helpful finish over the rude one, the safe one over the dangerous one. Teaching it those preferences is the alignment problem, and the standard answer, the one behind ChatGPT and InstructGPT, is RLHF: reinforcement learning from human feedback. It works. It is also a three-stage machine with a reputation for being finicky. You fine-tune a base model, then train a separate reward model to imitate human taste, then run a reinforcement-learning algorithm (usually PPO) that samples from the policy, scores each sample with the reward model, and nudges the policy uphill, over and over, while a leash keeps it from wandering too far from where it started.

Direct Preference Optimization (DPO, from Stanford) removes the middle and the end of that pipeline. No reward model. No reinforcement learning. No sampling from the model during training. You take the same pairs of responses a human ranked, and you train the language model on them with a loss that looks like ordinary classification. The claim in the title is the reason it works: the language model you are training is a reward model. The reward is not a separate network you fit and throw away. It is a simple function of the policy's own probabilities, and once you see that, the entire RLHF objective rearranges into something you can minimize with gradient descent and nothing else.

Getting there takes a short tower of ideas, and none of them is heavy. How a pairwise preference becomes a number (a reward). What RLHF is really maximizing. Why that objective has a known optimal solution in closed form. And the one algebraic move, inverting that solution, that turns the whole thing into a classification loss. We build them in order.

Aligning by trial and reward

Start with what you have: a model that can write, and a pile of human judgments. The judgments are comparisons. A human sees a prompt xx and two responses, picks the better one, and you record the winner ywy_w and the loser yly_l. Nobody assigns a score. People are bad at "rate this 7.3 out of 10" and good at "this one is better," so preference data is pairwise by design.

RLHF turns that pile into an aligned model in three stages. Stage one, supervised fine-tuning (SFT): train the base model on good example responses so it at least answers in the right format. Call the result πref\pi_{\text{ref}}, the reference policy. Stage two, fit a reward model rϕ(x,y)r_\phi(x,y), a separate network that reads a prompt and a response and outputs a scalar "how good," trained so that it scores the human-preferred response higher. Stage three, reinforcement learning: let the policy generate responses, score them with rϕr_\phi, and push the policy toward higher-scoring output, with a penalty for drifting too far from πref\pi_{\text{ref}}.

That third stage is the expensive, temperamental one. It samples from the model thousands of times per step, keeps two large networks live (the policy and the reward model), and PPO has a thicket of knobs that have to be set right or training quietly diverges. DPO's whole pitch is to keep stage one, delete stages two and three, and replace them with a single supervised loss. The figure below is the before and after.

Figure 1 · two pipelines
3 stages · a loop
RLHF is three stages with a sampling loop: SFT, then a separate reward model, then PPO whose policy is sampled and re-scored every step. DPO is one stage: the SFT model becomes a frozen reference and a single classification loss over the fixed preference pairs produces the aligned policy. No reward model, no sampling. Toggle between them.

Preferences are a logistic curve

Before we can train anything we need to connect "a human picked this one" to a number. The standard bridge is the Bradley-Terry model, from 1952, the same idea behind chess Elo. Suppose every response has a hidden quality score, a reward r(x,y)r^*(x,y). The model says the chance a human prefers y1y_1 over y2y_2 is set by the gap between their rewards, run through a softmax of two terms:

p(y1y2x)=exp(r(x,y1))exp(r(x,y1))+exp(r(x,y2))=σ(r(x,y1)r(x,y2))p^*(y_1 \succ y_2 \mid x) = \frac{\exp\big(r^*(x,y_1)\big)}{\exp\big(r^*(x,y_1)\big) + \exp\big(r^*(x,y_2)\big)} = \sigma\big(r^*(x,y_1) - r^*(x,y_2)\big)(1)

The two forms are the same thing. Divide top and bottom by exp(r(x,y1))\exp(r^*(x,y_1)) and the right side collapses to the logistic sigmoid σ(z)=1/(1+ez)\sigma(z) = 1/(1+e^{-z}) of the reward difference. Read it as a statement about confidence. If the two responses have equal reward, the gap is zero, σ(0)=0.5\sigma(0) = 0.5, and the human is a coin flip. As the winner pulls ahead, the probability of preferring it climbs an S-curve toward one but never quite gets there, because humans are noisy and sometimes pick the worse answer. Only the difference of rewards matters, which is a fact we will lean on hard later.

Figure 2 · the Bradley-Terry model
P(win) = 77%
The chance a human prefers the winner is σ of the reward gap. At gap zero it is a coin flip; a positive gap bends the preference toward the winner along the logistic curve. Drag the gap and watch the vote split. Only the difference of rewards enters, never the absolute level.

This gives you a way to fit a reward model. You have a dataset D\mathcal{D} of triples (x,yw,yl)(x, y_w, y_l). Pick the reward function that makes the observed preferences most likely. Under Bradley-Terry, the probability of each observed "ywy_w beat yly_l" is σ(r(x,yw)r(x,yl))\sigma(r(x,y_w) - r(x,y_l)), so maximizing the likelihood means minimizing its negative log:

LR(rϕ,D)=E(x,yw,yl)D[logσ(rϕ(x,yw)rϕ(x,yl))]\mathcal{L}_R(r_\phi, \mathcal{D}) = -\,\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}}\Big[\log \sigma\big(r_\phi(x,y_w) - r_\phi(x,y_l)\big)\Big](2)

That is RLHF's stage two in one line: a binary classifier that learns to score winners above losers. Hold onto its shape, the log-sigmoid of a reward difference, because the punchline of this whole post is that DPO's loss is this exact expression with the reward written a different way.

What RLHF actually maximizes

Now stage three. With a reward model in hand, you want a policy πθ\pi_\theta that earns high reward. But naively maximizing reward is a trap. The reward model is an imperfect imitation of human taste, and a policy free to chase it will find the cracks: degenerate text that scores high and reads like nonsense. (Reward hacking. Goodhart's law with a GPU.) So RLHF adds a leash. Maximize reward, but stay close to the reference policy πref\pi_{\text{ref}} you started from, measured by KL divergence:

maxπθ ExD,yπθ(x)[rϕ(x,y)]    βDKL ⁣[πθ(yx)πref(yx)]\max_{\pi_\theta}\ \mathbb{E}_{x\sim\mathcal{D},\,y\sim\pi_\theta(\cdot\mid x)}\big[r_\phi(x,y)\big] \;-\; \beta\, \mathbb{D}_{\mathrm{KL}}\!\big[\pi_\theta(y\mid x)\,\|\,\pi_{\text{ref}}(y\mid x)\big](3)

Unpack the two terms. The first rewards good output. The second is a penalty, the KL divergence, which measures how far the new policy's distribution has moved from the reference. The coefficient β\beta sets the strength of the leash. The KL term is what keeps the policy fluent and on-distribution while it shifts toward what humans prefer, and it is the reason DPO needs to keep πref\pi_{\text{ref}} around at all.

This objective looks like it demands reinforcement learning, because yy is sampled from πθ\pi_\theta (the expectation is over the policy's own output, which moves as you train, which is exactly what makes it an RL problem). RLHF solves it with PPO. But this particular objective, reward minus a KL penalty, is special. It has a known optimal solution you can write down in one line.

The optimal policy has a formula

Forget gradient descent for a moment and just ask: of all possible policies, which one maximizes (3)? This is a textbook result (it shows up under names like "control as inference" and the Gibbs variational principle), and the answer is clean. The best policy reweights the reference by the exponential of the reward:

πr(yx)=1Z(x)πref(yx)exp ⁣(1βr(x,y))\pi_r(y\mid x) = \frac{1}{Z(x)}\,\pi_{\text{ref}}(y\mid x)\,\exp\!\Big(\tfrac{1}{\beta}\,r(x,y)\Big)(4)

Read it as a tilt. Start with the reference distribution πref\pi_{\text{ref}}. For each response, multiply its probability by exp(r/β)\exp(r/\beta), which boosts high-reward responses and suppresses low-reward ones. Then divide by the partition function Z(x)Z(x), a per-prompt number that rescales everything back into a valid probability distribution that sums to one:

Z(x)=yπref(yx)exp ⁣(r(x,y)/β)Z(x) = \sum_y \pi_{\text{ref}}(y\mid x)\,\exp\!\big(r(x,y)/\beta\big)

The role of β\beta is now vivid. When β\beta is large, r/βr/\beta is small, the exponential is near one, and the optimal policy barely budges from πref\pi_{\text{ref}} (a strong leash). When β\beta is small, r/βr/\beta is huge, the exponential is sharp, and the optimal policy piles all its mass on the single highest-reward response (no leash, pure greed). Drag β\beta and watch the reference distribution tilt:

Figure 3 · the KL-constrained optimum
β = 0.70
The optimal policy is the reference distribution reweighted by exp(r/β). Each candidate response keeps its reference probability times the exponential of its reward, then the whole thing is renormalized. Large β stays near the reference; small β collapses onto the highest-reward response. This is the exact optimum of objective (3), no RL required to state it.

So why does anyone bother with PPO if the optimum is a formula? Because the formula is useless as written. To actually sample from πr\pi_r you need Z(x)Z(x), and Z(x)Z(x) is a sum over every possible response to the prompt, an astronomically large set. Estimating it is, in the paper's words, expensive enough to make this representation hard to use in practice. The optimum is known but unreachable. RLHF gives up on the formula and grinds toward it with sampling. DPO does something cleverer.

The reward was in the policy all along

Here is the move the whole paper turns on. Equation (4) expresses the optimal policy in terms of the reward. Solve it the other way: express the reward in terms of the policy. Take the log of both sides of (4) and rearrange for rr:

r(x,y)=βlogπr(yx)πref(yx)+βlogZ(x)r(x,y) = \beta\,\log\frac{\pi_r(y\mid x)}{\pi_{\text{ref}}(y\mid x)} + \beta\,\log Z(x)(5)

Stare at this. It says any reward function can be rewritten as β\beta times the log-ratio of its own optimal policy to the reference, plus a term that depends only on the prompt. The reward and the policy are two encodings of the same thing. If you know the optimal policy you know the reward, exactly, up to that βlogZ(x)\beta\log Z(x) offset.

And the offset does not matter. Look back at the Bradley-Terry model (1): preferences depend only on the difference of two rewards for the same prompt. Plug (5) into that difference. The term βlogZ(x)\beta\log Z(x) depends on xx alone, so it is identical for ywy_w and yly_l, and it cancels:

r(x,yw)r(x,yl)=βlogπr(ywx)πref(ywx)βlogπr(ylx)πref(ylx)r(x,y_w) - r(x,y_l) = \beta\,\log\frac{\pi_r(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} - \beta\,\log\frac{\pi_r(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)}(6)

The intractable partition function is gone. The thing that made (4) unusable was the one thing that did not survive the subtraction. What is left is a reward difference written entirely in quantities you can compute: a forward pass of the policy and a forward pass of the frozen reference. (This works for any reward in the Bradley-Terry equivalence class. Two rewards that differ by a prompt-only function f(x)f(x) induce the same preferences and the same optimal policy, and the log-ratio parameterization can represent every such class. That is Theorem 1 in the paper. The βlogZ(x)\beta\log Z(x) offset is exactly one of those harmless shifts.)

DPO: one classification loss

We are one substitution from the end. Recall the reward-model loss (2): a log-sigmoid of a reward difference, fit to the preference data. We just wrote that reward difference (6) purely in terms of the policy. So substitute. Instead of fitting a separate reward model and then optimizing a policy against it, parameterize the reward by the policy and fit it directly to the preferences. The reward-model loss becomes a loss on πθ\pi_\theta:

LDPO(πθ;πref)=E(x,yw,yl)D ⁣[logσ ⁣(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))]\mathcal{L}_{\text{DPO}}(\pi_\theta;\pi_{\text{ref}}) = -\,\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}}\!\left[\log \sigma\!\left(\beta\,\log\frac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} - \beta\,\log\frac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)}\right)\right](7)

This is DPO. The entire method. It is the reward-model objective (2) with the reward replaced by βlog(πθ/πref)\beta\log(\pi_\theta/\pi_{\text{ref}}), the policy's own implicit reward. There is no reward network, no sampling, no RL loop. You have a fixed dataset of preference pairs, and you minimize (7) by plain gradient descent, the same way you would train any classifier. The quantity

r^θ(x,y)=βlogπθ(yx)πref(yx)\hat{r}_\theta(x,y) = \beta\,\log\frac{\pi_\theta(y\mid x)}{\pi_{\text{ref}}(y\mid x)}(8)

is the implicit reward: the reward model defined by the language model and the reference, never trained on its own but emerging the instant you optimize (7). That is the secret reward model of the title. When DPO finishes, you have an aligned policy and, baked into it, a reward function consistent with the human preferences, readable off the log-ratios whenever you want it.

It helps to picture the loss as a landscape. Forget the prompt-only offset and track two numbers: how much the policy raises the winner above the reference, aw=log(πθ(yw)/πref(yw))a_w = \log(\pi_\theta(y_w)/\pi_{\text{ref}}(y_w)), and how much it raises the loser, al=log(πθ(yl)/πref(yl))a_l = \log(\pi_\theta(y_l)/\pi_{\text{ref}}(y_l)). The loss only cares about the margin awala_w - a_l. It is low in the upper-left, where the policy lifts the winner and pushes down the loser, and high in the lower-right, where it has them backwards.

Figure 4 · the DPO loss landscape
grad weight 0.73
The two axes are how far the policy lifts the winner and the loser above the reference. The loss depends only on the margin between them, so it falls toward the upper-left. The gradient arrow always points up and left, raise the winner and lower the loser, and it shrinks once the winner is safely ahead. Drag the two sliders to move the policy around the landscape.

What the update does

The loss is tidy, but the gradient is where the intuition lives. Differentiate (7) and you get

θLDPO=βE(x,yw,yl)D[σ(r^θ(x,yl)r^θ(x,yw))[θlogπθ(ywx)θlogπθ(ylx)]]\nabla_\theta \mathcal{L}_{\text{DPO}} = -\,\beta\,\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}}\Big[\sigma\big(\hat{r}_\theta(x,y_l) - \hat{r}_\theta(x,y_w)\big)\,\big[\,\nabla_\theta \log\pi_\theta(y_w\mid x) - \nabla_\theta \log\pi_\theta(y_l\mid x)\,\big]\Big](9)

Two pieces. The bracket on the right is the direction: increase the log-probability of the winner, decrease the log-probability of the loser. Every preference pair pushes the policy the same way, up on ywy_w, down on yly_l. The scalar in front is the interesting part. It is σ(r^θ(x,yl)r^θ(x,yw))\sigma(\hat{r}_\theta(x,y_l) - \hat{r}_\theta(x,y_w)), the probability that the model's current implicit reward ranks the pair backwards. When the policy already scores the winner well above the loser, that sigmoid is near zero and the example is nearly ignored. When the policy has the pair wrong, scoring the loser higher, the weight approaches one and the example gets the full push.

This is the built-in safeguard that hand-tuned RLHF tricks try to bolt on. DPO spends its gradient on the examples it currently gets wrong and coasts past the ones it already has right, automatically, because the weight is the model's own error. Drag the margin and watch the weight and the push respond:

Figure 5 · the gradient weight
weight 0.77
The update weight is σ(r̂_l − r̂_w), the chance the current model ranks the pair backwards. Examples the model already gets right (positive margin) are nearly ignored; examples it gets wrong get the full push. The right panel shows that push, scaled by the weight: raise the winner, lower the loser.

What it looks like in practice

Let me make the abstractions concrete with real shapes. A response yy is a sequence of tokens, say 20 of them. The model gives a probability to each next token, and logπθ(yx)\log\pi_\theta(y\mid x) is the sum of those per-token log-probabilities over the whole response. (The official code sums, it does not average, by default. That detail matters: longer responses accumulate more negative log-prob, so the raw log-probs you compare are sums.)

# log pi(y|x) is the SUM of per-token log-probs over the response (average_log_prob=False)
logp = logits.log_softmax(-1).gather(-1, labels).squeeze(-1)   # per-token log-prob
logp = (logp * loss_mask).sum(-1)                              # sum over the response tokens
# so for a 20-token answer, log pi(y|x) is a sum of 20 negative numbers, e.g. about -34

So for each preference pair you run four forward passes: the policy and the reference, each on the winner and the loser, giving four scalar log-probs. The loss is built from those four numbers and nothing else.

# trainers.py: the entire DPO loss (beta = 0.1 for dialogue)
pi_logratios  = policy_chosen_logps    - policy_rejected_logps     # log pi_theta(y_w) / pi_theta(y_l)
ref_logratios = reference_chosen_logps - reference_rejected_logps  # same ratio under pi_ref
logits = pi_logratios - ref_logratios        # = a_w - a_l, the implicit margin / beta
losses = -F.logsigmoid(beta * logits)         # eq (7): a binary classification loss
# the implicit rewards, for logging only (detached):
chosen_rewards   = beta * (policy_chosen_logps   - reference_chosen_logps).detach()
rejected_rewards = beta * (policy_rejected_logps - reference_rejected_logps).detach()

Walk one example through it with β=0.1\beta = 0.1. Suppose the reference assigns summed log-probs logπref(yw)=34\log\pi_{\text{ref}}(y_w) = -34 and logπref(yl)=31\log\pi_{\text{ref}}(y_l) = -31 (the reference slightly prefers the loser, which is why we are correcting it). After some training the policy has moved to logπθ(yw)=30\log\pi_\theta(y_w) = -30 and logπθ(yl)=35\log\pi_\theta(y_l) = -35. Then the policy log-ratio is 30(35)=5-30 - (-35) = 5, the reference log-ratio is 34(31)=3-34 - (-31) = -3, and logits=5(3)=8\text{logits} = 5 - (-3) = 8. With β=0.1\beta = 0.1 that is βlogits=0.8\beta \cdot \text{logits} = 0.8, so the loss is logσ(0.8)0.37-\log\sigma(0.8) \approx 0.37. The implicit rewards are r^w=0.1(30(34))=0.4\hat{r}_w = 0.1(-30-(-34)) = 0.4 and r^l=0.1(35(31))=0.4\hat{r}_l = 0.1(-35-(-31)) = -0.4, so the policy now scores the winner above the loser by a margin of 0.80.8. The gradient weight on this example, σ(r^lr^w)=σ(0.8)0.31\sigma(\hat{r}_l - \hat{r}_w) = \sigma(-0.8) \approx 0.31, is already shrinking: the model mostly has this pair right and will spend less on it next time.

Two practical notes the code pins down. First, where does πref\pi_{\text{ref}} come from? It is the SFT checkpoint, kept frozen. If you do not have an SFT model for your data, the paper's fallback is to make one by fine-tuning the base model on the preferred responses alone, πref=argmaxπE[logπ(ywx)]\pi_{\text{ref}} = \arg\max_\pi \mathbb{E}[\log\pi(y_w\mid x)], so the reference is at least on-distribution. Second, β\beta: the released config has no default and the example commands use β=0.1\beta = 0.1 for dialogue and summarization. The paper did not meaningfully tune it, which is part of the "no significant hyperparameter tuning" claim.

The same file ships two optional variants worth knowing. Setting label_smoothing to a small ε\varepsilon gives conservative DPO, which assumes a fraction ε\varepsilon of the human labels are flipped and softens the loss so the model never tries to drive any margin to infinity. Setting reference_free drops πref\pi_{\text{ref}} entirely (treats it as uniform), which is simpler but loses the leash. The default is plain DPO with neither.

So what does it actually do

It matches or beats RLHF while being far simpler to run. On controlling the sentiment of generated text, where you can measure the true reward-versus-KL frontier exactly, DPO sits on the best frontier at every KL budget, ahead of PPO: more reward for the same drift from the reference. On TL;DR summarization, judged by GPT-4 against human reference summaries, DPO wins about 61% of the time at temperature 0, against PPO's 57%. On Anthropic's single-turn dialogue, DPO is the one efficient method that improves on the preferred completions in the data. And DPO holds up across sampling temperatures where PPO degrades, which matters because a brittle-to-temperature method is a brittle method.

The reasons it is easier to live with are structural. There is no reward model to train, store, or overfit. There is no sampling from the policy during training, so a training step is a forward and backward pass on fixed data, like any supervised job. There is no PPO, so the long list of RL knobs is gone; the main dial is β\beta. The cost is four forward passes per pair instead of one, and you keep a frozen reference model in memory during training. At inference you ship one ordinary language model.

The limits are honest ones. DPO trains on a fixed set of preferences (it is off-policy), so it never sees the responses the improving model would actually generate, the way online PPO does. That can matter when the preference data is far from what the tuned model produces, and it has motivated online and iterative variants that refresh the pairs as the policy moves. DPO inherits Bradley-Terry's assumptions, including that a single scalar reward explains the comparisons. And like all of these methods it is only as aligned as the preferences it is fed; garbage preferences in, confidently-aligned garbage out. None of that has slowed it down. DPO is now a default first move for preference tuning, precisely because the reward model it removed was never a separate thing. It was a way of reading the policy. We just stopped fitting it twice.

Provenance Verified against primary literature
DPO (2023)Rafailov et al.: the reward reparameterization, the closed-form optimum, the DPO loss (eq 7), and the gradient analysis.
trainers.py (code)Official implementation. preference_loss: logits = (chosen-rejected policy log-ratio) - (chosen-rejected ref log-ratio), loss = -logsigmoid(beta*logits). _get_batch_logps sums token log-probs by default.
Bradley & Terry (1952)The logistic model of pairwise comparison: P(i beats j) depends on the difference of latent scores.
InstructGPT / RLHF (2022)Ouyang et al.: the three-stage pipeline DPO collapses (SFT, reward model, PPO).
Control as inferenceThe KL-constrained reward objective has a known Boltzmann optimum, the exponential tilt of the reference policy.
correctionThe paper has no single default for beta and the released config ships it unset (beta: ???); the README's example commands and the experiments use beta = 0.1, which is what we quote. Separately, eq (7) puts a beta inside the sigmoid on each log-ratio, while the code factors it out once (logits = pi-log-ratio - ref-log-ratio, then -logsigmoid(beta*logits)). These are algebraically identical; we teach eq (7) and show the code form in the listing.

Questions you might still have

?

If there is no separate reward model, what plays its role?
The policy itself. The quantity beta times log(pi_theta/pi_ref) is an implicit reward: train the policy with the DPO loss and that quantity becomes a reward model consistent with the preference data, for free. That is the title.

?

Why does the intractable partition function Z(x) not matter?
Preferences depend only on reward differences, and Z(x) depends only on the prompt x, so it is identical for the winner and the loser and cancels in the subtraction. DPO never has to compute it.

?

Does DPO still need a reference model at run time?
Only during training, to compute the log-ratio. You keep a frozen copy of the SFT model as pi_ref. At inference you ship only pi_theta, an ordinary language model with no extra machinery.

?

What is beta doing?
It is the same KL temperature as in RLHF. Large beta keeps the policy close to pi_ref (timid updates); small beta lets preferences move it hard. The released code uses beta = 0.1 for dialogue and summarization.

Footnotes & further reading

  1. The paper: Rafailov, Sharma, Mitchell, Ermon, Manning, Finn, Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Stanford, NeurIPS 2023). Code.
  2. The RLHF pipeline DPO collapses: Ouyang et al., Training language models to follow instructions with human feedback (InstructGPT), and Stiennon et al., Learning to summarize from human feedback.
  3. The pairwise-comparison model: Bradley & Terry, Rank Analysis of Incomplete Block Designs (1952).
  4. The closed-form optimum of the KL-regularized reward objective (control as inference / the Gibbs variational principle): see e.g. Levine, Reinforcement Learning and Control as Probabilistic Inference.
  5. The conservative-DPO label-smoothing variant and the IPO objective both live in the same preference_loss function; IPO is Azar et al., A General Theoretical Paradigm to Understand Learning from Human Preferences.