Part V: Alignment & RLHF
Chapter 24

Direct Preference Optimization

DPO, IPO, and preference learning without RL
20 Exercises
24.1

Chapter 23 showed that RLHF works — and how painful it is. You must train a separate reward model, juggle up to four models in memory, run a delicate reinforcement-learning loop with generation inside it, fight reward hacking with a finicky KL coefficient, and constantly distrust the reward-model score. It is powerful but fragile, slow, and hard. This chapter introduces a method that achieves the same goal — aligning a model with human preferences — while throwing away most of that machinery.

The Question DPO Asks

Direct Preference Optimization (DPO; Rafailov et al., 2023) starts from a bold question: do we actually NEED the reward model and the reinforcement learning? Both exist to solve one problem — turn preference pairs into a better policy. What if we could go DIRECTLY from preference pairs to the policy, with ordinary supervised learning, skipping the reward model and the RL loop entirely?

The astonishing answer, which we will derive in this chapter, is YES. DPO shows that the entire RLHF objective — reward model plus KL-penalized RL — can be rewritten as a SINGLE classification loss on preference pairs, trainable with the same simple machinery as SFT. No reward model. No reinforcement learning. No reward hacking. No generation in the loop.

RLHF (Chapter 23)DPO (this chapter)
Train a separate reward modelNo reward model
Reinforcement learning (PPO/GRPO)Supervised learning
Up to 4 models in memoryJust 2 (policy + reference)
Generation inside the training loopTrain directly on fixed pairs
Reward hacking is a constant dangerNo reward to hack
Delicate, unstable, many knobsStable, few knobs
Pref Note: DPO Made Alignment Accessible
DPO had an effect on alignment similar to what LoRA had on fine-tuning: it took something that required a specialized team and expensive infrastructure and made it accessible to almost anyone. Because DPO is 'just' a supervised loss on preference pairs, you can run it with the same tools and roughly the same effort as SFT.
This is why a huge fraction of open-model alignment today uses DPO or one of its variants rather than PPO. It captures most of RLHF's benefit at a fraction of the complexity — which, for most practitioners, is exactly the right trade.

A Note on What We Keep

DPO does not throw away everything. It still needs the same PREFERENCE DATA — the (prompt, chosen, rejected) triples from Chapter 23 — and it still uses a frozen REFERENCE model to anchor the policy (the same role the reference played in the KL penalty). What it eliminates is the explicit reward model and the reinforcement-learning loop. The data and the 'stay close to the reference' idea survive; the heavy RL machinery does not.

24.2

The magic of DPO rests on one beautiful mathematical fact. To understand it, recall the KL-penalized RLHF objective from Chapter 23: maximize the reward while staying close to the reference model. It turns out this objective has a known CLOSED-FORM optimal solution — we can write down exactly what the best policy looks like for a given reward.

The Optimal RLHF Policy

For the KL-penalized objective, the optimal policy assigns each response a probability proportional to the reference model's probability times the exponentiated reward. Intuitively: start from the reference, then up-weight high-reward responses and down-weight low-reward ones, with the KL coefficient β controlling how aggressively.

textThe optimal KL-penalized policy
π*(y|x)  ∝  π_ref(y|x) · exp( r(x,y) / β )

# Start from the reference π_ref, tilt toward high reward r,
# with β controlling how hard. This is the policy RLHF is trying to reach.

Now Run It Backwards

Here is the pivotal move. The equation above expresses the optimal policy in terms of the reward. But we can ALGEBRAICALLY INVERT it to express the reward in terms of the policy. Solving for r, the reward is (up to a constant) the log-ratio between the policy and the reference, scaled by β:

textThe implicit reward (the key insight)
r(x,y) = β · log( π(y|x) / π_ref(y|x) )  +  constant

# ANY policy implicitly defines a reward function!
# The 'reward' of a response is how much MORE likely the policy
# makes it, relative to the reference (scaled by β).
This Is the Whole Trick
Read the equation above slowly — it is the entire idea behind DPO. It says the reward model is REDUNDANT: any policy already implies a reward, namely how much it boosts a response's probability over the reference. We do not need to train a separate network to output rewards; the policy itself contains a reward function, readable as a log-probability-ratio.
So instead of (1) training a reward model and then (2) optimizing the policy against it, DPO collapses both steps into one: it directly trains the policy so that its IMPLICIT reward prefers the chosen responses over the rejected ones. Two stages become one.

Substituting Into the Preference Model

Recall the Bradley-Terry preference model from Chapter 23: the probability that the chosen response beats the rejected one is the sigmoid of their reward difference. DPO substitutes the IMPLICIT reward (the log-ratio above) into Bradley-Terry. The unknown reward function disappears, replaced entirely by the policy and the reference. What remains is a loss expressed purely in terms of the policy we are training — no reward model anywhere.

24.3

Let us put the pieces together into the DPO loss. We start from Bradley-Terry, substitute the implicit reward, and watch the reward model vanish. We will go step by step so the derivation is fully transparent — this is the heart of the chapter.

Step by Step

textThe DPO derivation
1. Bradley-Terry (Ch.23):
     P(y_w ≻ y_l) = σ( r(x,y_w) - r(x,y_l) )

2. Substitute the implicit reward r = β log(π/π_ref) + const:
     the constants cancel in the DIFFERENCE, leaving
     r(x,y_w) - r(x,y_l) =
        β log(π(y_w|x)/π_ref(y_w|x)) - β log(π(y_l|x)/π_ref(y_l|x))

3. The DPO loss = negative log-likelihood of the preferences:
textThe DPO loss
L_DPO = -E[ log σ( β log(π(y_w|x)/π_ref(y_w|x))
                    - β log(π(y_l|x)/π_ref(y_l|x)) ) ]

y_w = chosen,  y_l = rejected,  π = policy,  π_ref = frozen reference
# A simple classification loss. No reward model. No RL.

What This Loss Does, In Words

Strip away the symbols and the DPO loss says something intuitive. For each preference pair, it increases the policy's probability of the CHOSEN response and decreases its probability of the REJECTED response — but always MEASURED RELATIVE TO THE REFERENCE model. The β controls how strongly, and the reference anchors the policy so it does not drift into nonsense. It is the Bradley-Terry preference loss from Chapter 23, but applied directly to the policy's implicit reward.

Intuition: DPO Is 'Contrastive' Fine-Tuning
You can think of DPO as a contrastive version of SFT. Plain SFT says 'make the good response more likely.' DPO says 'make the chosen response more likely AND the rejected response less likely, relative to where the reference model started.' The rejected response provides a negative signal that SFT lacks — recovering one of the things RLHF had over imitation (Chapter 23), but without any RL.
This is why DPO often beats SFT on the same data: it uses the rejected responses as informative negatives, teaching the model what NOT to do, not just what to do.
Pref Note: The Reward Model Never Appears
Trace through the final loss: it contains only the policy π, the frozen reference π_ref, and the chosen/rejected responses. The reward function r that we started with has completely vanished — it was substituted away in step 2. This is the precise sense in which DPO needs 'no reward model': the reward is implicit in the policy, so it never has to be materialized as a separate network.
This vanishing act is what makes DPO so much simpler than RLHF. The two-stage 'train reward model, then optimize policy' pipeline collapses into a single supervised training run on the preference data.
24.4

One of DPO's joys is how little code it takes. Because it is just a supervised loss, the implementation is close to ordinary fine-tuning. We need to compute log-probabilities of the chosen and rejected responses under both the policy and the frozen reference, then plug them into the DPO loss. Let us build it.

Computing Sequence Log-Probabilities

PythonLog-probability of a response under a model
import torch; import torch.nn.functional as F

def sequence_logprob(model, prompt_ids, response_ids):
    """Sum of log P(response | prompt) under `model`."""
    ids = torch.cat([prompt_ids, response_ids])[
        None]  # (1, T)
    logits = model(ids[:, :-1]).logits        # predict each next token
    logp = F.log_softmax(logits[0], dim=-1)
    # Sum log-prob over the RESPONSE tokens only
    start = len(prompt_ids) - 1
    total = 0.0
    for i, tok in enumerate(response_ids):
        total += logp[start + i, tok]
    return total

The DPO Loss

PythonThe DPO loss from scratch
def dpo_loss(policy, ref, prompt, chosen, rejected, beta=0.1):
    """DPO loss for one preference pair."""
    # Log-probs of chosen & rejected under the POLICY (trainable)
    pol_chosen   = sequence_logprob(policy, prompt, chosen)
    pol_rejected = sequence_logprob(policy, prompt, rejected)

    # Same under the REFERENCE (frozen, no gradients)
    with torch.no_grad():
        ref_chosen   = sequence_logprob(ref, prompt, chosen)
        ref_rejected = sequence_logprob(ref, prompt, rejected)

    # The implicit reward = beta * (policy_logp - ref_logp)
    chosen_reward   = beta * (pol_chosen   - ref_chosen)
    rejected_reward = beta * (pol_rejected - ref_rejected)

    # DPO loss: -log sigma(chosen_reward - rejected_reward)
    return -F.logsigmoid(chosen_reward - rejected_reward)

# That's the whole algorithm. Compare to the four-model PPO loop of Ch.23!
# Training is just: for each pair, compute this loss, backprop, step.

The Training Loop

PythonCode Lab: the full DPO training loop
import torch

policy = load_sft_model().cuda()            # the model we train
ref    = load_sft_model().cuda().eval()     # frozen copy; the anchor
for p in ref.parameters(): p.requires_grad = False

opt = torch.optim.AdamW(policy.parameters(), lr=5e-7)  # small lr

for prompt, chosen, rejected in preference_loader:
    loss = dpo_loss(policy, ref, prompt, chosen, rejected, beta=0.1)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(policy.parameters(), 1.0)
    opt.step(); opt.zero_grad()

# Notice: NO reward model, NO value model, NO generation, NO RL.
# Just two models and a supervised loss on fixed preference pairs.
# This is why DPO trains about as easily as SFT.
Pref Note: Combine DPO with LoRA
Because DPO is just supervised learning, it composes seamlessly with the parameter-efficient methods of Chapter 22. You can run DPO with LoRA adapters on a frozen, possibly quantized base — 'DPO + QLoRA' — to align a large model on a single GPU. The reference model can even be the same frozen base, with the policy being base+LoRA. This combination is one of the most popular and accessible alignment recipes today.
Contrast this with PPO, where adding LoRA on top of an already-complex four-model RL loop is far more involved. DPO's simplicity is what makes these clean combinations possible.
24.5

DPO eliminated the reward model and the RL loop, but it kept two things from RLHF: the reference model and the β coefficient. Understanding their roles is essential to using DPO well, and clears up common beginner confusion.

The Role of Beta

In RLHF, β was the KL coefficient — it controlled how far the policy could drift from the reference. In DPO, β plays the analogous role, but now it scales the implicit reward (the log-ratio). A small β lets the policy move far from the reference (strong preference optimization, more drift); a large β keeps it close (gentle optimization, less drift). It is the same 'leash tension' idea as the KL penalty, baked directly into the loss.

Beta valueEffect
Small (e.g. 0.01)Policy drifts far from reference — strong optimization, risk of degeneration
Medium (e.g. 0.1)Balanced — the common default starting point
Large (e.g. 0.5)Policy stays close to reference — gentle, conservative updates

Why the Reference Model Matters

The reference model anchors DPO, just as it anchored the KL penalty in RLHF. Without it, DPO would have no notion of 'how far have we drifted' — it would push chosen-response probabilities up without restraint, potentially collapsing into degenerate outputs. The reference, usually the SFT model, defines the sensible starting point that the policy is contrasted against. The implicit reward is always measured RELATIVE to this anchor.

⚠️
A Subtle DPO Failure: Both Probabilities Can Drop
A surprising and well-documented DPO behaviour: the loss only cares about the DIFFERENCE between the chosen and rejected rewards (log-ratios), not their absolute values. So DPO can decrease the probability of the chosen response — as long as it decreases the rejected one MORE. In the extreme, the model can make BOTH chosen and rejected less likely, which can degrade overall generation quality.
This is a real, studied limitation. It is part of why variants like IPO and the careful use of β and the reference exist (Sections 24.7–24.8). Monitoring the absolute log-probability of chosen responses during DPO training, not just the loss, is a useful diagnostic to catch this.
24.6

DPO is simpler than the RL methods of Chapter 23, but simpler is not always better. Understanding the trade-offs lets you choose well. Let us compare DPO against both PPO and GRPO across the dimensions that matter in practice.

DimensionDPOPPOGRPO
Reward modelNone (implicit)RequiredRequired (or verifier)
RL loopNoYesYes
Models in memory243
Generation in loopNoYesYes
StabilityHighLowMedium
Ease of useHighLowMedium
Online explorationNo (fixed data)YesYes
Best forGeneral alignmentMax performanceReasoning/verifiable

The Crucial Difference: Online vs Offline

The deepest distinction is this: DPO is OFFLINE — it trains on a FIXED dataset of preference pairs, never generating new responses. PPO and GRPO are ONLINE — they GENERATE fresh responses from the current policy and get feedback on those. Online methods can explore and improve beyond the fixed data; offline DPO is limited to the preferences it was given.

Offline (DPO)
Trains on a fixed, pre-collected dataset of preference pairs. Cannot explore new responses. Simple and stable, but limited by the data's coverage.
Online (PPO, GRPO)
Generates fresh responses during training and gets feedback on them. Can explore and discover better responses, but is more complex and unstable.

This online/offline gap is the main reason the RL methods sometimes outperform DPO. Because PPO and GRPO generate and evaluate fresh responses, they can find high-quality outputs that never appeared in any fixed dataset. DPO can only re-weight the responses it was shown. For pushing absolute peak performance — especially on reasoning, where GRPO can sample and verify many fresh solution attempts — the online methods retain an edge.

ML Connection: Why GRPO Won for Reasoning, DPO Won for General Alignment
The choice often comes down to whether you can generate-and-check. For general chat alignment, good preference data is available and offline DPO is simple and effective — so DPO dominates open-model alignment. For reasoning (Chapter 25), you can generate many solution attempts and VERIFY them automatically (does the math check out?), which is exactly the online setting GRPO excels at — so GRPO dominates reasoning training.
So the methods are complementary, not competing: DPO for accessible general alignment from fixed preferences, GRPO for online optimization where fresh samples can be scored. Many modern pipelines use BOTH at different stages.
Pref Note: Iterative / Online DPO
The offline limitation of DPO can be partly addressed with ITERATIVE DPO: train with DPO, generate fresh responses from the improved model, collect new preferences on those, and repeat. This reintroduces some of the exploration benefit of online methods while keeping DPO's per-round simplicity. It is a popular middle ground that narrows the gap to PPO/GRPO.
The general lesson: the offline/online distinction is a spectrum, not a binary. Iterative DPO sits between pure offline DPO and fully online RL, trading some simplicity for some exploration.
24.7

DPO has a known weakness: it can OVERFIT to the preferences, especially when the chosen and rejected responses are clearly different or when preferences are nearly deterministic. The model pushes the chosen/rejected reward gap toward infinity, which can hurt generation. IPO (Identity Preference Optimization; Azar et al., 2023) addresses this with a small but principled change to the loss.

The Overfitting Problem

Recall the DPO loss uses a sigmoid of the reward difference. When a preference is clear, the loss keeps pushing the reward gap larger and larger — there is no point at which it is 'satisfied'. With finite data this drives the model to over-confident extremes, drifting too far from the reference and degrading quality. DPO lacks a natural stopping point for how large the preference margin should become.

IPO's Fix

IPO replaces the sigmoid-based DPO loss with a SQUARED-ERROR loss that targets a SPECIFIC margin rather than pushing it to infinity. Instead of 'make the chosen reward as much bigger as possible', IPO says 'make the chosen reward bigger by a specific, regularized amount'. This gives the optimization a target to settle at, preventing the runaway over-optimization.

textIPO vs DPO loss (schematically)
DPO:  -log σ(β · (margin))         # pushes margin → ∞, can overfit

IPO:  ( margin  -  1/(2β) )²         # targets a finite margin, regularized

where margin = log(π(y_w)/π_ref(y_w)) - log(π(y_l)/π_ref(y_l))
# IPO's squared loss has a minimum at a finite margin → no runaway.
Pref Note: IPO Is a One-Line Change
In implementation, switching from DPO to IPO is essentially changing the loss function — replacing the -log-sigmoid with a squared-error-to-target term. Everything else (the log-probability computations, the reference model, the training loop) stays identical. This is typical of the DPO variant family: they share DPO's structure and differ mainly in the exact loss form.
IPO is worth trying when you observe DPO overfitting — the telltale signs being the chosen/rejected gap exploding and generation quality degrading even as the loss looks great. It trades a little of DPO's aggressiveness for more stability.
24.8

DPO and IPO both need PAIRED preference data (a chosen and a rejected response for the SAME prompt) and a separate REFERENCE model. Two further variants relax these requirements, making alignment even more practical in settings where paired data or a reference is inconvenient.

KTO: No Pairs Needed

KTO (Kahneman-Tversky Optimization; Ethayarajh et al., 2024) removes the need for PAIRED data. Instead of requiring a chosen AND rejected response for each prompt, KTO works with INDIVIDUAL responses each labeled simply 'good' or 'bad' — a binary thumbs-up/thumbs-down. This is far easier to collect: real-world feedback (a user liking or disliking a response) is naturally unpaired.

KTO draws on prospect theory from behavioural economics (the Kahneman-Tversky model of how humans weigh gains and losses) to define a loss over these individual labels. The practical upshot: if your feedback is unpaired thumbs-up/down signals — as most real product feedback is — KTO lets you use it directly, without constructing pairs.

ORPO: No Reference Model Needed

ORPO (Odds Ratio Preference Optimization; Hong et al., 2024) removes the need for a separate REFERENCE model AND combines SFT and preference optimization into a SINGLE stage. It adds a preference term (based on the odds ratio between chosen and rejected) directly to the standard SFT loss. The model learns to follow instructions AND prefer good responses in one training run, with no separate reference model to hold in memory.

MethodNeeds pairs?Needs reference?Key advantage
DPOYesYesThe original; simple, effective
IPOYesYesResists overfitting
KTONoYesUses unpaired good/bad labels
ORPOYesNoOne stage; no reference model
A Family United by One Idea
DPO, IPO, KTO, and ORPO all share the same core insight from Section 24.2: the reward is implicit in the policy, so preference alignment can be done with a direct supervised loss rather than RL. They differ only in the exact loss form and what data or models they require. Once you understand DPO's derivation, the variants are easy — each is a targeted modification for a specific practical need.
This is the signature of a mature subfield: a foundational idea (implicit reward) spawns a family of practical variants, each tuned for a different constraint (overfitting, unpaired data, no reference). Knowing the foundation lets you understand and even invent variants.
24.9

Since DPO trains directly on preference pairs with no reward model to smooth things over, the QUALITY of the preference data matters even more than in RLHF. The reward model in RLHF could generalize beyond noisy individual labels; DPO is more directly exposed to whatever is in the pairs. This makes preference-data curation central to DPO success.

Sources of Preference Data

Human comparisons: annotators pick the better of two responses (the classic source, highest quality, expensive).
AI feedback (RLAIF): a strong model judges which response is better, following a rubric — scalable and increasingly common.
Existing datasets: public preference datasets (e.g. from chat platforms or curated collections) provide ready-made pairs.
Synthetic / constructed: generate a good and a deliberately worse response (e.g. via a weaker model or by corruption) to form pairs.

Common Pitfalls

PitfallWhy it hurts DPO
Noisy / inconsistent labelsDPO trusts each pair directly; noise pushes the policy wrong
Length biasIf chosen responses are systematically longer, DPO learns 'longer = better'
Distribution mismatchPairs from a different model than the policy give weak signal
Too-easy pairsIf chosen ≫ rejected trivially, the model learns little of value
Contamination of styleModels learn formatting quirks present in the chosen responses
⚠️
The Length-Bias Trap
A famous DPO (and RLHF) pitfall: if your chosen responses tend to be LONGER than the rejected ones — which happens naturally, since annotators often prefer thorough answers — the model can learn the shortcut 'longer is better' and become needlessly verbose. This is a form of reward hacking that DPO is just as susceptible to as RLHF, even without an explicit reward model.
Defenses include length-balancing the preference data, adding explicit length penalties, or using variants/regularizers designed to decorrelate length from preference. Always check whether your aligned model has simply become longer rather than genuinely better.
Pref Note: On-Policy Data Helps DPO
DPO works best when the preference pairs come from responses generated by a model CLOSE to the one being trained (ideally the SFT model itself). If the pairs come from a very different model, the responses lie outside the policy's natural distribution, and DPO's signal is weaker. This is the offline-data version of the distribution-matching concern — and a key reason iterative DPO (regenerating pairs from the improving model) helps.
Practically: when possible, generate your candidate responses from your own SFT model, then collect preferences on those. This 'on-policy' preference data gives DPO the strongest, most relevant signal.
24.10

With DPO, its variants, and the RL methods of Chapter 23 all in hand, how do you choose? Here is a practical decision guide reflecting current best practice.

SituationRecommended method
Starting out / general alignmentDPO — simple, stable, effective
DPO is overfitting / degradingIPO
Only unpaired good/bad labelsKTO
Want one-stage SFT+alignment, no referenceORPO
Limited GPU memoryDPO + QLoRA
Reasoning / verifiable rewardsGRPO (Chapter 23)
Chasing absolute peak performancePPO or GRPO (online)
Have lots of compute and expertiseIterative DPO or online RL

The Pragmatic Default

For the vast majority of practitioners and use cases, the answer is: START WITH DPO. It captures most of the benefit of preference alignment, trains as easily as SFT, is stable, and composes with LoRA/QLoRA for single-GPU use. Reach for the variants when you hit a specific problem (IPO for overfitting, KTO for unpaired data, ORPO to skip the reference), and reach for the online RL methods (GRPO, PPO) only when you need online exploration — most importantly, for reasoning with verifiable rewards.

Pref Note: Start Simple, Escalate Only as Needed
The recurring theme of Part V: start with the simplest method that could work, and add complexity only when a real limitation forces you to. SFT before preference optimization; DPO before RL; rejection sampling before PPO. Most of the value comes from the simple methods done well on good data. The exotic, heavyweight methods earn their complexity only in specific high-stakes settings.
This is not just pragmatism — it is good engineering discipline. Each layer of complexity you add is a layer that can break, that must be tuned, and that must be debugged. Earn it.
24.11

DPO did more than provide a new algorithm — it shifted how the field thinks about and practices alignment. It is worth stepping back to see the broader change.

From RL to Supervised Learning

Before DPO, 'aligning a model with preferences' was synonymous with reinforcement learning — it required RL expertise, infrastructure, and tolerance for instability. DPO showed that the same goal could be reached with supervised learning. This reframing made alignment a mainstream skill rather than a specialized one, and it is a large part of why aligned open models proliferated after 2023.

The Theoretical Unification

DPO also clarified the THEORY. By showing that the RLHF objective and a simple classification loss are two views of the same thing (via the implicit-reward insight), it connected reinforcement learning, preference modeling, and supervised learning. The variants (IPO, KTO, ORPO) and even the connection to GRPO all flow from understanding this unified picture. Alignment became less of a bag of tricks and more of a coherent framework.

ML Connection: The Ongoing Dialogue Between Simple and Powerful
The DPO-vs-RL story is a recurring pattern in machine learning: a powerful but complex method (PPO-RLHF) is followed by a simpler reformulation (DPO) that captures most of its value, which is then followed by recognition of where the powerful method still wins (online RL for reasoning, via GRPO). Neither fully replaces the other; they coexist, each suited to different needs.
The mature practitioner knows both: the simple method for the common case, the powerful method for the demanding case, and the judgment to tell them apart. This is the synthesis Part V has been building toward — and Chapter 25, on reasoning, is where the online RL methods come back into their own.

What Remains Hard

DPO simplified the ALGORITHM, but the hard parts of alignment remain: collecting good preference data, deciding what 'better' means (the values question), avoiding shortcuts like length bias, and the deeper challenge of specifying human values at all. DPO made the optimization easy; it did not make alignment easy. The remaining difficulty has moved from the algorithm to the data and the values — which is exactly where Chapter 26, on Constitutional AI and safety, will focus.

24.12

DPO Quick-Reference

ConceptKey ideaRemember
DPO motivationAlignment without reward model or RLCaptures RLHF's benefit, far simpler
Implicit rewardr = β log(π/π_ref)Reward is hidden in the policy
DPO loss-log σ(βΔ of log-ratios)Bradley-Terry on implicit reward
What it keepsReference model + β2 models, not 4
Offline vs onlineDPO offline; PPO/GRPO onlineOnline can explore beyond data
IPOSquared loss to a target marginFixes DPO overfitting
KTOUnpaired good/bad labelsNo pairs needed
ORPOSFT + preference, no referenceOne stage, no reference model

Exercises

Exercises 1–10 are pen-and-paper or derivations; 11–20 require code.

Exercise 1: Pen & Paper
List everything DPO eliminates compared to PPO-RLHF, and the two things it keeps. Why does keeping the reference model matter?
Exercise 2: Pen & Paper
Explain in your own words the key insight that 'reward is implicit in the policy'. What is the implicit reward of a response?
Exercise 3: Derive
Starting from the optimal KL-penalized policy π* ∝ π_ref · exp(r/β), solve for r to obtain the implicit reward r = β log(π/π_ref) + const.
Exercise 4: Derive
Substitute the implicit reward into the Bradley-Terry model and show that the constant cancels in the difference, yielding the DPO loss.
Exercise 5: Pen & Paper
Explain why DPO is a 'contrastive' form of SFT. What signal does it use that plain SFT does not?
Exercise 6: Pen & Paper
Describe the role of β in DPO. What happens to the policy as β → 0 and as β → large? Connect to the RLHF KL penalty.
Exercise 7: Pen & Paper
Explain the failure mode where DPO decreases BOTH chosen and rejected probabilities. Why can this happen, and what would you monitor to catch it?
Exercise 8: Pen & Paper
Compare offline (DPO) and online (PPO/GRPO) preference optimization. Why can online methods exceed the quality of any fixed dataset?
Exercise 9: Pen & Paper
Explain how IPO fixes DPO's overfitting. Why does a squared-error-to-target loss avoid the runaway margin of the sigmoid loss?
Exercise 10: Pen & Paper
Compare DPO, IPO, KTO, and ORPO on whether they need paired data and a reference model. Give a scenario where each is the best choice.
Exercise 11: Code
Implement sequence_logprob: the summed log-probability of a response under a model. Verify it against a manual computation on a tiny example.
Exercise 12: Code
Implement the DPO loss from scratch using your sequence_logprob. Verify that it decreases when the policy raises the chosen and lowers the rejected log-prob.
Exercise 13: Code Lab
Build the full DPO training loop with a frozen reference. Fine-tune a small SFT model on preference pairs and show before/after that it prefers chosen-style responses.
Exercise 14: Code
Track the absolute log-probabilities of chosen and rejected responses during DPO training. Demonstrate the 'both drop' failure mode by overtraining, and show how it correlates with quality loss.
Exercise 15: Code
Sweep β over several values in your DPO loop. Plot how far the policy drifts from the reference (KL divergence) as a function of β.
Exercise 16: Code
Implement the IPO loss (squared error to a target margin) and swap it into your training loop. Compare its overfitting behaviour to DPO on the same data.
Exercise 17: Code
Implement KTO-style training on UNPAIRED good/bad labeled responses. Show it can align a model without constructing preference pairs.
Exercise 18: Code
Demonstrate the length-bias trap: construct preference data where chosen responses are systematically longer, run DPO, and show the model becomes verbose. Then length-balance the data and show the effect shrinks.
Exercise 19: Code Lab
Combine DPO with LoRA: run DPO where the policy is base+LoRA and the reference is the frozen base. Confirm only the adapter trains and report the memory savings.
Exercise 20: Code (Challenge)
Build a head-to-head comparison: align the same SFT model with (a) DPO and (b) your GRPO loop from Chapter 23, on the same preference data / reward. Compare final quality (held-out judge), training stability, wall-clock time, and memory. Write up the offline-vs-online trade-off you observe, and recommend which to use for general alignment vs reasoning.

Further reading: “Direct Preference Optimization: Your Language Model is Secretly a Reward Model” (Rafailov et al., 2023) — the DPO paper, whose subtitle captures the implicit-reward insight. “A General Theoretical Paradigm to Understand Learning from Human Preferences” (Azar et al., 2023, IPO). “KTO: Model Alignment as Prospect Theoretic Optimization” (Ethayarajh et al., 2024). “ORPO: Monolithic Preference Optimization without Reference Model” (Hong et al., 2024). “Is DPO Superior to PPO for LLM Alignment?” (Xu et al., 2024) for the DPO-vs-RL debate. The Hugging Face TRL library implements DPO, IPO, KTO, ORPO, and GRPO behind a consistent interface.


Next → Chapter 25: Reasoning & Chain-of-Thought

You can now align a model with human preferences — by RL (Chapter 23) or directly (this chapter). But alignment to preferences mostly shapes HOW a model responds, not how well it THINKS through hard problems. Chapter 25 turns to reasoning: chain-of-thought prompting that lets models work step by step, the training methods that make reasoning a learned skill, and the test-time-compute paradigm where models 'think longer' to solve harder problems. Crucially, this is where the online RL methods — especially GRPO with verifiable rewards — return to center stage: by generating many solution attempts and rewarding the correct ones, models can be taught to reason. The preference optimization of Part V's first half meets the reasoning revolution.

20 Exercises in this chapter
Attempt each exercise before checking the worked solutions.
View Solutions →