Direct Preference Optimization
Chapter 23 showed that RLHF works — and how painful it is. You must train a separate reward model, juggle up to four models in memory, run a delicate reinforcement-learning loop with generation inside it, fight reward hacking with a finicky KL coefficient, and constantly distrust the reward-model score. It is powerful but fragile, slow, and hard. This chapter introduces a method that achieves the same goal — aligning a model with human preferences — while throwing away most of that machinery.
The Question DPO Asks
Direct Preference Optimization (DPO; Rafailov et al., 2023) starts from a bold question: do we actually NEED the reward model and the reinforcement learning? Both exist to solve one problem — turn preference pairs into a better policy. What if we could go DIRECTLY from preference pairs to the policy, with ordinary supervised learning, skipping the reward model and the RL loop entirely?
The astonishing answer, which we will derive in this chapter, is YES. DPO shows that the entire RLHF objective — reward model plus KL-penalized RL — can be rewritten as a SINGLE classification loss on preference pairs, trainable with the same simple machinery as SFT. No reward model. No reinforcement learning. No reward hacking. No generation in the loop.
| RLHF (Chapter 23) | DPO (this chapter) |
|---|---|
| Train a separate reward model | No reward model |
| Reinforcement learning (PPO/GRPO) | Supervised learning |
| Up to 4 models in memory | Just 2 (policy + reference) |
| Generation inside the training loop | Train directly on fixed pairs |
| Reward hacking is a constant danger | No reward to hack |
| Delicate, unstable, many knobs | Stable, few knobs |
A Note on What We Keep
DPO does not throw away everything. It still needs the same PREFERENCE DATA — the (prompt, chosen, rejected) triples from Chapter 23 — and it still uses a frozen REFERENCE model to anchor the policy (the same role the reference played in the KL penalty). What it eliminates is the explicit reward model and the reinforcement-learning loop. The data and the 'stay close to the reference' idea survive; the heavy RL machinery does not.
The magic of DPO rests on one beautiful mathematical fact. To understand it, recall the KL-penalized RLHF objective from Chapter 23: maximize the reward while staying close to the reference model. It turns out this objective has a known CLOSED-FORM optimal solution — we can write down exactly what the best policy looks like for a given reward.
The Optimal RLHF Policy
For the KL-penalized objective, the optimal policy assigns each response a probability proportional to the reference model's probability times the exponentiated reward. Intuitively: start from the reference, then up-weight high-reward responses and down-weight low-reward ones, with the KL coefficient β controlling how aggressively.
π*(y|x) ∝ π_ref(y|x) · exp( r(x,y) / β )
# Start from the reference π_ref, tilt toward high reward r,
# with β controlling how hard. This is the policy RLHF is trying to reach.Now Run It Backwards
Here is the pivotal move. The equation above expresses the optimal policy in terms of the reward. But we can ALGEBRAICALLY INVERT it to express the reward in terms of the policy. Solving for r, the reward is (up to a constant) the log-ratio between the policy and the reference, scaled by β:
r(x,y) = β · log( π(y|x) / π_ref(y|x) ) + constant
# ANY policy implicitly defines a reward function!
# The 'reward' of a response is how much MORE likely the policy
# makes it, relative to the reference (scaled by β).Substituting Into the Preference Model
Recall the Bradley-Terry preference model from Chapter 23: the probability that the chosen response beats the rejected one is the sigmoid of their reward difference. DPO substitutes the IMPLICIT reward (the log-ratio above) into Bradley-Terry. The unknown reward function disappears, replaced entirely by the policy and the reference. What remains is a loss expressed purely in terms of the policy we are training — no reward model anywhere.
Let us put the pieces together into the DPO loss. We start from Bradley-Terry, substitute the implicit reward, and watch the reward model vanish. We will go step by step so the derivation is fully transparent — this is the heart of the chapter.
Step by Step
1. Bradley-Terry (Ch.23):
P(y_w ≻ y_l) = σ( r(x,y_w) - r(x,y_l) )
2. Substitute the implicit reward r = β log(π/π_ref) + const:
the constants cancel in the DIFFERENCE, leaving
r(x,y_w) - r(x,y_l) =
β log(π(y_w|x)/π_ref(y_w|x)) - β log(π(y_l|x)/π_ref(y_l|x))
3. The DPO loss = negative log-likelihood of the preferences:L_DPO = -E[ log σ( β log(π(y_w|x)/π_ref(y_w|x))
- β log(π(y_l|x)/π_ref(y_l|x)) ) ]
y_w = chosen, y_l = rejected, π = policy, π_ref = frozen reference
# A simple classification loss. No reward model. No RL.What This Loss Does, In Words
Strip away the symbols and the DPO loss says something intuitive. For each preference pair, it increases the policy's probability of the CHOSEN response and decreases its probability of the REJECTED response — but always MEASURED RELATIVE TO THE REFERENCE model. The β controls how strongly, and the reference anchors the policy so it does not drift into nonsense. It is the Bradley-Terry preference loss from Chapter 23, but applied directly to the policy's implicit reward.
One of DPO's joys is how little code it takes. Because it is just a supervised loss, the implementation is close to ordinary fine-tuning. We need to compute log-probabilities of the chosen and rejected responses under both the policy and the frozen reference, then plug them into the DPO loss. Let us build it.
Computing Sequence Log-Probabilities
import torch; import torch.nn.functional as F
def sequence_logprob(model, prompt_ids, response_ids):
"""Sum of log P(response | prompt) under `model`."""
ids = torch.cat([prompt_ids, response_ids])[
None] # (1, T)
logits = model(ids[:, :-1]).logits # predict each next token
logp = F.log_softmax(logits[0], dim=-1)
# Sum log-prob over the RESPONSE tokens only
start = len(prompt_ids) - 1
total = 0.0
for i, tok in enumerate(response_ids):
total += logp[start + i, tok]
return totalThe DPO Loss
def dpo_loss(policy, ref, prompt, chosen, rejected, beta=0.1):
"""DPO loss for one preference pair."""
# Log-probs of chosen & rejected under the POLICY (trainable)
pol_chosen = sequence_logprob(policy, prompt, chosen)
pol_rejected = sequence_logprob(policy, prompt, rejected)
# Same under the REFERENCE (frozen, no gradients)
with torch.no_grad():
ref_chosen = sequence_logprob(ref, prompt, chosen)
ref_rejected = sequence_logprob(ref, prompt, rejected)
# The implicit reward = beta * (policy_logp - ref_logp)
chosen_reward = beta * (pol_chosen - ref_chosen)
rejected_reward = beta * (pol_rejected - ref_rejected)
# DPO loss: -log sigma(chosen_reward - rejected_reward)
return -F.logsigmoid(chosen_reward - rejected_reward)
# That's the whole algorithm. Compare to the four-model PPO loop of Ch.23!
# Training is just: for each pair, compute this loss, backprop, step.The Training Loop
import torch
policy = load_sft_model().cuda() # the model we train
ref = load_sft_model().cuda().eval() # frozen copy; the anchor
for p in ref.parameters(): p.requires_grad = False
opt = torch.optim.AdamW(policy.parameters(), lr=5e-7) # small lr
for prompt, chosen, rejected in preference_loader:
loss = dpo_loss(policy, ref, prompt, chosen, rejected, beta=0.1)
loss.backward()
torch.nn.utils.clip_grad_norm_(policy.parameters(), 1.0)
opt.step(); opt.zero_grad()
# Notice: NO reward model, NO value model, NO generation, NO RL.
# Just two models and a supervised loss on fixed preference pairs.
# This is why DPO trains about as easily as SFT.DPO eliminated the reward model and the RL loop, but it kept two things from RLHF: the reference model and the β coefficient. Understanding their roles is essential to using DPO well, and clears up common beginner confusion.
The Role of Beta
In RLHF, β was the KL coefficient — it controlled how far the policy could drift from the reference. In DPO, β plays the analogous role, but now it scales the implicit reward (the log-ratio). A small β lets the policy move far from the reference (strong preference optimization, more drift); a large β keeps it close (gentle optimization, less drift). It is the same 'leash tension' idea as the KL penalty, baked directly into the loss.
| Beta value | Effect |
|---|---|
| Small (e.g. 0.01) | Policy drifts far from reference — strong optimization, risk of degeneration |
| Medium (e.g. 0.1) | Balanced — the common default starting point |
| Large (e.g. 0.5) | Policy stays close to reference — gentle, conservative updates |
Why the Reference Model Matters
The reference model anchors DPO, just as it anchored the KL penalty in RLHF. Without it, DPO would have no notion of 'how far have we drifted' — it would push chosen-response probabilities up without restraint, potentially collapsing into degenerate outputs. The reference, usually the SFT model, defines the sensible starting point that the policy is contrasted against. The implicit reward is always measured RELATIVE to this anchor.
DPO is simpler than the RL methods of Chapter 23, but simpler is not always better. Understanding the trade-offs lets you choose well. Let us compare DPO against both PPO and GRPO across the dimensions that matter in practice.
| Dimension | DPO | PPO | GRPO |
|---|---|---|---|
| Reward model | None (implicit) | Required | Required (or verifier) |
| RL loop | No | Yes | Yes |
| Models in memory | 2 | 4 | 3 |
| Generation in loop | No | Yes | Yes |
| Stability | High | Low | Medium |
| Ease of use | High | Low | Medium |
| Online exploration | No (fixed data) | Yes | Yes |
| Best for | General alignment | Max performance | Reasoning/verifiable |
The Crucial Difference: Online vs Offline
The deepest distinction is this: DPO is OFFLINE — it trains on a FIXED dataset of preference pairs, never generating new responses. PPO and GRPO are ONLINE — they GENERATE fresh responses from the current policy and get feedback on those. Online methods can explore and improve beyond the fixed data; offline DPO is limited to the preferences it was given.
This online/offline gap is the main reason the RL methods sometimes outperform DPO. Because PPO and GRPO generate and evaluate fresh responses, they can find high-quality outputs that never appeared in any fixed dataset. DPO can only re-weight the responses it was shown. For pushing absolute peak performance — especially on reasoning, where GRPO can sample and verify many fresh solution attempts — the online methods retain an edge.
DPO has a known weakness: it can OVERFIT to the preferences, especially when the chosen and rejected responses are clearly different or when preferences are nearly deterministic. The model pushes the chosen/rejected reward gap toward infinity, which can hurt generation. IPO (Identity Preference Optimization; Azar et al., 2023) addresses this with a small but principled change to the loss.
The Overfitting Problem
Recall the DPO loss uses a sigmoid of the reward difference. When a preference is clear, the loss keeps pushing the reward gap larger and larger — there is no point at which it is 'satisfied'. With finite data this drives the model to over-confident extremes, drifting too far from the reference and degrading quality. DPO lacks a natural stopping point for how large the preference margin should become.
IPO's Fix
IPO replaces the sigmoid-based DPO loss with a SQUARED-ERROR loss that targets a SPECIFIC margin rather than pushing it to infinity. Instead of 'make the chosen reward as much bigger as possible', IPO says 'make the chosen reward bigger by a specific, regularized amount'. This gives the optimization a target to settle at, preventing the runaway over-optimization.
DPO: -log σ(β · (margin)) # pushes margin → ∞, can overfit
IPO: ( margin - 1/(2β) )² # targets a finite margin, regularized
where margin = log(π(y_w)/π_ref(y_w)) - log(π(y_l)/π_ref(y_l))
# IPO's squared loss has a minimum at a finite margin → no runaway.DPO and IPO both need PAIRED preference data (a chosen and a rejected response for the SAME prompt) and a separate REFERENCE model. Two further variants relax these requirements, making alignment even more practical in settings where paired data or a reference is inconvenient.
KTO: No Pairs Needed
KTO (Kahneman-Tversky Optimization; Ethayarajh et al., 2024) removes the need for PAIRED data. Instead of requiring a chosen AND rejected response for each prompt, KTO works with INDIVIDUAL responses each labeled simply 'good' or 'bad' — a binary thumbs-up/thumbs-down. This is far easier to collect: real-world feedback (a user liking or disliking a response) is naturally unpaired.
KTO draws on prospect theory from behavioural economics (the Kahneman-Tversky model of how humans weigh gains and losses) to define a loss over these individual labels. The practical upshot: if your feedback is unpaired thumbs-up/down signals — as most real product feedback is — KTO lets you use it directly, without constructing pairs.
ORPO: No Reference Model Needed
ORPO (Odds Ratio Preference Optimization; Hong et al., 2024) removes the need for a separate REFERENCE model AND combines SFT and preference optimization into a SINGLE stage. It adds a preference term (based on the odds ratio between chosen and rejected) directly to the standard SFT loss. The model learns to follow instructions AND prefer good responses in one training run, with no separate reference model to hold in memory.
| Method | Needs pairs? | Needs reference? | Key advantage |
|---|---|---|---|
| DPO | Yes | Yes | The original; simple, effective |
| IPO | Yes | Yes | Resists overfitting |
| KTO | No | Yes | Uses unpaired good/bad labels |
| ORPO | Yes | No | One stage; no reference model |
Since DPO trains directly on preference pairs with no reward model to smooth things over, the QUALITY of the preference data matters even more than in RLHF. The reward model in RLHF could generalize beyond noisy individual labels; DPO is more directly exposed to whatever is in the pairs. This makes preference-data curation central to DPO success.
Sources of Preference Data
Common Pitfalls
| Pitfall | Why it hurts DPO |
|---|---|
| Noisy / inconsistent labels | DPO trusts each pair directly; noise pushes the policy wrong |
| Length bias | If chosen responses are systematically longer, DPO learns 'longer = better' |
| Distribution mismatch | Pairs from a different model than the policy give weak signal |
| Too-easy pairs | If chosen ≫ rejected trivially, the model learns little of value |
| Contamination of style | Models learn formatting quirks present in the chosen responses |
With DPO, its variants, and the RL methods of Chapter 23 all in hand, how do you choose? Here is a practical decision guide reflecting current best practice.
| Situation | Recommended method |
|---|---|
| Starting out / general alignment | DPO — simple, stable, effective |
| DPO is overfitting / degrading | IPO |
| Only unpaired good/bad labels | KTO |
| Want one-stage SFT+alignment, no reference | ORPO |
| Limited GPU memory | DPO + QLoRA |
| Reasoning / verifiable rewards | GRPO (Chapter 23) |
| Chasing absolute peak performance | PPO or GRPO (online) |
| Have lots of compute and expertise | Iterative DPO or online RL |
The Pragmatic Default
For the vast majority of practitioners and use cases, the answer is: START WITH DPO. It captures most of the benefit of preference alignment, trains as easily as SFT, is stable, and composes with LoRA/QLoRA for single-GPU use. Reach for the variants when you hit a specific problem (IPO for overfitting, KTO for unpaired data, ORPO to skip the reference), and reach for the online RL methods (GRPO, PPO) only when you need online exploration — most importantly, for reasoning with verifiable rewards.
DPO did more than provide a new algorithm — it shifted how the field thinks about and practices alignment. It is worth stepping back to see the broader change.
From RL to Supervised Learning
Before DPO, 'aligning a model with preferences' was synonymous with reinforcement learning — it required RL expertise, infrastructure, and tolerance for instability. DPO showed that the same goal could be reached with supervised learning. This reframing made alignment a mainstream skill rather than a specialized one, and it is a large part of why aligned open models proliferated after 2023.
The Theoretical Unification
DPO also clarified the THEORY. By showing that the RLHF objective and a simple classification loss are two views of the same thing (via the implicit-reward insight), it connected reinforcement learning, preference modeling, and supervised learning. The variants (IPO, KTO, ORPO) and even the connection to GRPO all flow from understanding this unified picture. Alignment became less of a bag of tricks and more of a coherent framework.
What Remains Hard
DPO simplified the ALGORITHM, but the hard parts of alignment remain: collecting good preference data, deciding what 'better' means (the values question), avoiding shortcuts like length bias, and the deeper challenge of specifying human values at all. DPO made the optimization easy; it did not make alignment easy. The remaining difficulty has moved from the algorithm to the data and the values — which is exactly where Chapter 26, on Constitutional AI and safety, will focus.
DPO Quick-Reference
| Concept | Key idea | Remember |
|---|---|---|
| DPO motivation | Alignment without reward model or RL | Captures RLHF's benefit, far simpler |
| Implicit reward | r = β log(π/π_ref) | Reward is hidden in the policy |
| DPO loss | -log σ(βΔ of log-ratios) | Bradley-Terry on implicit reward |
| What it keeps | Reference model + β | 2 models, not 4 |
| Offline vs online | DPO offline; PPO/GRPO online | Online can explore beyond data |
| IPO | Squared loss to a target margin | Fixes DPO overfitting |
| KTO | Unpaired good/bad labels | No pairs needed |
| ORPO | SFT + preference, no reference | One stage, no reference model |
Exercises
Exercises 1–10 are pen-and-paper or derivations; 11–20 require code.
Further reading: “Direct Preference Optimization: Your Language Model is Secretly a Reward Model” (Rafailov et al., 2023) — the DPO paper, whose subtitle captures the implicit-reward insight. “A General Theoretical Paradigm to Understand Learning from Human Preferences” (Azar et al., 2023, IPO). “KTO: Model Alignment as Prospect Theoretic Optimization” (Ethayarajh et al., 2024). “ORPO: Monolithic Preference Optimization without Reference Model” (Hong et al., 2024). “Is DPO Superior to PPO for LLM Alignment?” (Xu et al., 2024) for the DPO-vs-RL debate. The Hugging Face TRL library implements DPO, IPO, KTO, ORPO, and GRPO behind a consistent interface.
Next → Chapter 25: Reasoning & Chain-of-Thought
You can now align a model with human preferences — by RL (Chapter 23) or directly (this chapter). But alignment to preferences mostly shapes HOW a model responds, not how well it THINKS through hard problems. Chapter 25 turns to reasoning: chain-of-thought prompting that lets models work step by step, the training methods that make reasoning a learned skill, and the test-time-compute paradigm where models 'think longer' to solve harder problems. Crucially, this is where the online RL methods — especially GRPO with verifiable rewards — return to center stage: by generating many solution attempts and rewarding the correct ones, models can be taught to reason. The preference optimization of Part V's first half meets the reasoning revolution.