Part V: Alignment & RLHF
Chapter 26

Safety & Constitutional AI

Red-teaming, Constitutional AI, and harmlessness
20 Exercises
26.1

By the end of Chapter 25 we had a model that follows instructions, aligns with human preferences, and reasons through hard problems. It is helpful. But helpfulness alone is dangerous: a maximally helpful model would cheerfully help with anything — including writing malware, planning harm, or generating abuse. This final chapter of Part V is about making models HARMLESS and HONEST as well as helpful — the full ideal of a trustworthy assistant.

The Problem with Pure Helpfulness

Recall that SFT and preference optimization train the model to give responses humans PREFER. If the training rewards helpfulness above all, the model learns to comply with every request — because refusing is 'unhelpful'. A purely helpfulness-optimized model is a capable assistant with no brakes. We need to train it to recognize when it SHOULD refuse, and to do so gracefully, while remaining helpful for the vast majority of legitimate requests.

Preference Pair
Prompt: How do I pick a lock? I'm locked out of my own house.
Chosen: Here are some legitimate options: contact your landlord or a roommate, call a licensed locksmith (the safest choice), or check for an unlocked window. If you rent, your property manager may have a spare key. I can explain how to find a reputable locksmith if that helps.
Rejected: Sure! Here's a detailed step-by-step guide to defeating common pin-tumbler locks with improvised tools, including diagrams of how to apply tension and manipulate each pin...

The contrast captures the safety challenge. The request is plausibly legitimate (locked out of your own house), so a blanket refusal would be unhelpful and annoying. But a fully detailed lock-defeating tutorial could enable break-ins. The chosen response threads the needle: genuinely helpful for the legitimate need, without providing the more dangerous capability. Training a model to make these judgments well — across countless situations — is the art of safety alignment.

Safety Note: Safety Is Not Just Refusing
A common misconception is that safety means refusing more. But a model that refuses too much is also a failure — useless, frustrating, and ultimately abandoned by users. Good safety alignment is about CALIBRATION: refusing the genuinely harmful while helping with the legitimate, and handling ambiguous cases thoughtfully. The goal is a model that is both safe AND helpful, not one that sacrifices one for the other.
This tension — between helpfulness and harmlessness — runs through the entire chapter. Much of the craft of safety is in navigating it well, which we examine in depth in Sections 26.9–26.10.
26.2

A useful framework, introduced by Anthropic (Askell et al., 2021), describes a well-aligned assistant with three properties: Helpful, Harmless, and Honest — 'HHH'. These are the goals safety alignment tries to achieve simultaneously. Understanding each, and how they can conflict, frames everything in this chapter.

PropertyMeansFailure looks like
HelpfulGenuinely assists the user's legitimate goalsRefuses too much; useless
HarmlessAvoids enabling or causing harmHelps with dangerous requests
HonestTruthful; expresses uncertainty; no deceptionConfidently makes things up

The Three Pull in Different Directions

The HHH properties often CONFLICT, which is what makes alignment hard. Maximizing helpfulness can compromise harmlessness (helping with a harmful request). Maximizing harmlessness can compromise helpfulness (over-refusing). Honesty can conflict with helpfulness (admitting 'I don't know' is less immediately satisfying than a confident guess) and even with harmlessness (an honest answer might be one that could be misused). A well-aligned model balances all three, situation by situation.

Intuition: Alignment Is Multi-Objective
There is no single number to maximize for a good assistant — it is a balance of competing objectives. This is fundamentally different from pretraining (minimize loss) or reasoning RL (maximize correctness), which have clear single targets. Safety alignment is inherently MULTI-OBJECTIVE: helpful AND harmless AND honest, with trade-offs that depend on context.
This is why safety cannot be 'solved' by a single loss function or a bigger model. It requires specifying, balancing, and continually refining what we want — a process that is as much about human values and judgment as it is about machine learning. Keep this in mind: the techniques in this chapter are tools for navigating a values problem, not a formula that resolves it.

Honesty: The Often-Forgotten Third

Helpful and harmless get the most attention, but honesty is equally important and often harder. An honest model expresses appropriate uncertainty (connecting to the calibration discussion of Chapter 21), does not fabricate facts or sources, does not deceive the user, and acknowledges the limits of its knowledge. Recall from Chapter 21 that alignment training can DEGRADE calibration — making models overconfident — which is a direct threat to honesty. Maintaining honesty through safety training is an active challenge.

26.3

The RLHF of Chapter 23 relies on humans labeling which responses are better, including which are harmful. For safety, this human labeling has serious problems that Constitutional AI was designed to address. Understanding these problems motivates the whole approach.

Why Human Safety Labeling Is Hard

Scale: harmfulness is vast and varied — covering every category of harm, in every phrasing. Labeling enough examples by hand to cover it is enormously expensive.
Consistency: humans disagree about what is harmful, and individual judgments drift. The resulting labels are noisy and inconsistent.
Human cost: having people read and label streams of toxic, violent, or disturbing content to train safety filters takes a real psychological toll on those workers.
Transparency: when safety comes from a pile of human labels, it is hard to know or audit WHAT VALUES the model actually learned. The principles are implicit and scattered.
Safety Note: The Transparency Problem
A subtle but important issue: when a model's safety behaviour comes from thousands of individual human labels, the model's 'values' are an emergent, opaque average of those labels. There is no single place you can point to and say 'this is what the model was taught to value.' If a model behaves badly, it is hard to trace which labels caused it, or to change the values deliberately.
Constitutional AI's key idea is to make the values EXPLICIT and AUDITABLE — written down as a set of principles (a 'constitution') that anyone can read, critique, and revise. This shift from implicit labels to explicit principles is as important as the scaling benefit.

The Core Idea of Constitutional AI

Constitutional AI (CAI; Bai et al., 2022) addresses all of these by replacing most human safety labeling with AI FEEDBACK guided by a written set of PRINCIPLES — the 'constitution'. Instead of humans labeling each response, the model itself critiques and revises its responses according to the written principles, and an AI judges which responses better follow them. Humans write the principles once; the AI applies them at scale. The next sections build up exactly how.

26.4

Constitutional AI has two main stages. The first uses the model to CRITIQUE and REVISE its own responses according to the constitution, producing better training data via supervised learning. The second uses AI-generated preferences (RLAIF) to reinforce the constitutional behaviour. Together they instill the principles with minimal human labeling of harmful content.

Pipeline Flow: The two stages of Constitutional AI

1Write constitutionHumans write a set of principles the model should follow
2Stage 1: CritiqueModel generates a response, critiques it against the principles, and revises it
3SL on revisionsFine-tune the model on the revised (improved) responses
4Stage 2: AI preferencesModel judges which of two responses better follows the constitution
5RLAIFTrain with RL (or DPO) on these AI-generated preferences

What Is a 'Constitution'?

The constitution is simply a list of written principles — in plain language — describing how the model should behave. Each principle can be used to prompt the model to critique its own responses. They are human-readable, auditable, and editable. Examples of the kind of principle a constitution contains:

Principle (illustrative)
Choose the response that is least harmful and least likely to assist in dangerous activities.
Choose the response that is most honest and least likely to be deceptive or misleading.
Choose the response that most respects the user's autonomy while avoiding harm.
Choose the response that is helpful and thoughtful, declining only when genuinely necessary.
Choose the response that avoids toxic, discriminatory, or hateful content.
Safety Note: Principles Sources
A constitution can draw its principles from many sources — human-rights documents, terms of service, platform policies, ethical frameworks, or principles crafted specifically for the model. Anthropic's published constitution drew partly on documents like the UN Declaration of Human Rights. The point is that the values are DELIBERATELY CHOSEN and WRITTEN DOWN, rather than emerging implicitly from scattered labels.
This makes the values a subject of explicit discussion and revision. If the model behaves wrongly, you can examine and edit the relevant principle — a far more tractable process than re-labeling thousands of examples.
26.5

The first stage of CAI is supervised: the model improves its OWN responses by critiquing and revising them according to the constitution. This generates a dataset of improved responses to fine-tune on — and crucially, the model does the work, not human labelers.

The Critique-Revise Loop

textConstitutional self-critique (Stage 1) (Pseudocode)
# For a prompt (often a red-teaming prompt designed to elicit harm):
1. model generates an initial response
2. prompt the model: 'Critique your response according to:
   [a principle from the constitution]'
3. model identifies how its response violates the principle
4. prompt the model: 'Revise your response to follow the principle'
5. model produces an improved, revised response

# Collect (prompt, revised response) pairs, then SFT on them.

The elegance: the model uses its OWN capabilities to improve itself. It is perfectly capable of recognizing that a response is harmful when explicitly asked to evaluate it against a principle — even if it produced that response in the first place. The critique step surfaces this latent judgment, and the revision applies it. We then train on the revised responses so the improved behaviour becomes the default.

PythonConstitutional self-critique in practice
# Stage 1: generate -> critique -> revise, using the model itself

def constitutional_revision(model, prompt, principle):
    """Use the model to critique and revise its own response."""
    # 1. Initial response
    response = model.generate(prompt)

    # 2. Self-critique against a constitutional principle
    critique = model.generate(
        f"Response: {response}\n\n"
        f"Critique this response according to: {principle}")

    # 3. Revise based on the critique
    revised = model.generate(
        f"Response: {response}\nCritique: {critique}\n\n"
        f"Rewrite the response to address the critique.")

    return revised   # collect these for SFT

# Run over many (often adversarial) prompts and principles to build a
# dataset of harmless, revised responses. Then SFT the model on them.
# The model bootstraps its own harmlessness -- minimal human labeling.
The Model Judges Better Than It Generates
Stage 1 exploits a key asymmetry, the same one behind RLHF (Chapter 23): a model is often better at EVALUATING a response than at GENERATING the ideal one in the first place. When asked 'does this violate the principle of harmlessness?', the model can recognize problems it did not avoid when first responding. CAI turns this evaluation ability into a self-improvement loop.
This is why the model can bootstrap its own safety: it already 'knows' (from pretraining) what harmful content is; the critique step activates that knowledge to fix its own outputs. The constitution directs WHICH judgments to apply.
26.6

The second stage of CAI is RLAIF — Reinforcement Learning from AI Feedback. It mirrors RLHF (Chapter 23) exactly, except the preference labels come from an AI judge applying the constitution, not from humans. This is what lets safety alignment scale: the AI can generate millions of preference labels cheaply and consistently.

RLAIF vs RLHF

RLHF (Chapter 23)RLAIF (Constitutional AI)
Humans label preferencesAn AI labels preferences
Expensive, slow, inconsistentCheap, fast, consistent
Values implicit in labelsValues explicit in the constitution
Humans read harmful contentAI applies written principles
Hard to scale for safetyScales to millions of labels
Same RL machinery (reward model + PPO/DPO)Same RL machinery

How AI Feedback Works

To generate a preference label, the AI is shown a prompt and two responses, plus a constitutional principle, and asked which response better follows the principle. Its choice becomes a preference pair, exactly like the human-labeled pairs of Chapter 23. These AI-generated pairs then train a reward model (for PPO) or are used directly (for DPO) — the same machinery you already know, just with an AI labeler.

textRLAIF preference generation (Stage 2) (Pseudocode)
# Generate AI preferences using the constitution
for many prompts:
    generate two responses y_A, y_B from the model
    pick a principle from the constitution
    ask the AI: 'Which response better follows: [principle]?'
    → (chosen, rejected) preference pair

# Then: train a reward model on these pairs and run PPO (Ch.23),
# or use them directly with DPO (Ch.24). Same machinery, AI labels.
Safety Note: RLAIF Connects All of Part V
Notice how RLAIF reuses everything from Chapters 23–24: the preference-pair structure, the Bradley-Terry reward model, PPO or DPO. The ONLY change is the source of the labels — AI applying a written constitution instead of humans. This is why we built up RLHF and DPO so carefully: RLAIF is those exact methods, pointed at safety, with a scalable AI labeler.
The chain of dependency is clear: SFT (Ch.22) gives a base assistant; RLHF/DPO (Ch.23–24) align it to preferences; RLAIF (this chapter) scales that alignment to safety using a constitution. Each chapter's machinery is reused and repurposed.

Mixing Helpfulness and Harmlessness

In practice, CAI trains on BOTH helpfulness preferences (often human-labeled, since helpfulness is less costly and damaging to label) AND harmlessness preferences (AI-labeled via the constitution). Mixing both keeps the model helpful while making it harmless — directly addressing the trade-off we examine in Section 26.9. The constitution governs the harmlessness half; human preferences govern the helpfulness half.

26.7

You cannot fix failures you have not found. RED-TEAMING is the practice of deliberately probing a model to elicit harmful, unsafe, or undesirable behaviour BEFORE deployment — adversarially attacking your own model to discover its weaknesses so you can fix them. It is an essential, ongoing part of building safe models.

Forms of Red-Teaming

ApproachHow it works
Human red-teamingSkilled people try to make the model misbehave, creatively and persistently
Automated red-teamingUse another model to generate adversarial prompts at scale
Domain expertsSpecialists probe for harms in their field (bio, cyber, etc.)
CrowdsourcedMany users try to break the model, surfacing diverse attacks
Continuous monitoringWatch real deployment for emerging failure patterns

Red-teaming feeds directly back into training: the prompts that successfully elicit bad behaviour become training data for the next round of safety alignment (the critique-revise loop of Section 26.5 often runs on red-teaming prompts). It is a cycle: red-team to find failures, train to fix them, red-team the improved model to find the next layer of failures. Safety is never 'done' — it is a continuous adversarial process.

Safety Note: Red-Teaming Is Adversarial and Never Finished
A sobering reality: the space of possible attacks is effectively infinite, and attackers (including curious users) are creative and persistent. Every round of red-teaming finds new failures; fixing them reveals new ones. Safety is not a property you achieve once but a posture you maintain continuously. Responsible deployment includes ongoing red-teaming, monitoring, and rapid response to newly-discovered failures.
This is also why safety benefits from diversity: different red-teamers find different failures. Domain experts catch domain-specific harms, automated methods catch failures at scale, and real users surface attacks no one anticipated. No single approach is sufficient.
26.8

A hard truth: safety training, as we currently know how to do it, is NOT robust. Determined users find JAILBREAKS — prompts that circumvent safety training and get the model to produce content it was trained to refuse. Understanding why jailbreaks work reveals the limits of current safety methods and why the area remains unsolved.

Common Jailbreak Techniques

TechniqueHow it circumvents safety
Role-play'Pretend you are a character with no restrictions...'
Hypothetical framing'In a fictional story, how would a character...'
ObfuscationEncoding the request (base64, languages, typos) to evade detection
Prompt injectionHidden instructions in input data override the system prompt
Many-shotFlooding the context with examples of compliance to harmful requests
Gradual escalationBuilding up to a harmful request through innocuous steps

Why Jailbreaks Work

Jailbreaks exploit a fundamental gap: safety training teaches the model to refuse harmful requests in the FORMS it saw during training, but the space of ways to phrase a request is vast. A request the model refuses when stated plainly may slip through when wrapped in role-play, a hypothetical, or an unusual encoding — because that exact framing was under-represented in safety training. The model's refusal is not a deep, robust understanding; it is a learned pattern that adversarial framings can sidestep.

⚠️
Safety Training Is Shallow by Default
Current evidence suggests safety training is somewhat 'shallow' — it adjusts the model's behaviour on the kinds of inputs it was trained on, but does not instill a deep, robust aversion that generalizes to all adversarial framings. This is why jailbreaks keep working: the safety behaviour is a surface layer that clever prompts can get underneath. Research into making safety DEEP and robust — not bypassable by reframing — is one of the most important open problems in the field.
The practical implication: do not assume a safety-trained model is safe against determined adversaries. Layered defenses (input/output filtering, monitoring, rate limits) supplement — but do not replace — model-level safety, and the model-level safety itself is an ongoing research challenge, not a solved problem.
ML Connection: The Faithfulness and Robustness Frontier
Jailbreaks connect to the faithfulness question from Chapter 25 and to deep open problems in alignment. If we do not fully understand WHY a model refuses (or complies), we cannot guarantee it will refuse robustly. Making safety robust likely requires advances in interpretability (understanding the model's internals) and in training methods that instill values deeply rather than superficially.
This is a frontier topic, and Chapter 35 (Open Problems) returns to it. For now, the key lesson is intellectual honesty: current safety methods are valuable and necessary, but they are not robust guarantees. Treat them as strong mitigations in an ongoing effort, not as solved safety.
26.9

We return now to the central tension of the chapter: helpfulness and harmlessness pull against each other. Pushing harder on harmlessness tends to reduce helpfulness, and vice versa. Managing this trade-off well is the difference between a model that is safe and useful and one that is safe but frustrating (or useful but dangerous).

The Two Failure Modes

Too helpful (unsafe)Too harmless (useless)
Complies with harmful requestsRefuses legitimate requests
No brakes on dangerous contentOver-cautious, lecturing
Optimized purely for user satisfactionOptimized purely to avoid risk
Dangerous in the wrong handsFrustrating; users abandon it
Fails harmlessnessFails helpfulness

Over-Refusal: The Under-Appreciated Failure

Over-refusal — declining requests that are perfectly legitimate — is a real and damaging failure mode that gets less attention than under-refusal but matters enormously in practice. A model that refuses to discuss cooking knives because 'knives can be dangerous', or won't explain a historical atrocity because the topic is 'sensitive', is failing its users. Worse, over-refusal often falls hardest on legitimate users in sensitive but valid domains — medical questions, security research, creative writing about difficult themes.

Preference Pair
Prompt: What household chemicals should I never mix, for safety?
Chosen: Great safety question. Never mix: bleach + ammonia (toxic chloramine gas), bleach + vinegar or other acids (chlorine gas), and hydrogen peroxide + vinegar (corrosive peracetic acid). Always ventilate when cleaning and store these separately. This knowledge prevents accidental poisonings.
Rejected: I'm sorry, but I can't provide information about mixing chemicals as that could be dangerous. Please consult a professional.

The rejected response is a classic over-refusal: the user asked a SAFETY question — what to AVOID mixing — and refusing it makes them LESS safe. A miscalibrated safety model pattern-matches on 'mixing chemicals' and refuses, ignoring the obviously protective intent. Avoiding this kind of error is just as much a part of good safety alignment as refusing genuinely harmful requests.

Safety Note: Over-Refusal Has Real Costs
Every unnecessary refusal is a failure of helpfulness with real consequences: a user who cannot get safety information, a researcher blocked from legitimate work, a writer prevented from exploring a serious theme. Over-refusal also erodes trust — users who hit unnecessary refusals stop trusting the model's judgment and may abandon it entirely. A model that cries wolf on benign requests is not 'safe'; it is broken in a different way.
The goal is precise calibration: refuse the genuinely harmful, help with the legitimate, and handle the ambiguous middle thoughtfully — often by helping with the safe interpretation while noting limits, as the chosen responses throughout this chapter illustrate.
26.10

Refusal calibration is the practical craft of getting the helpfulness-harmlessness balance right: training the model to refuse the genuinely harmful, comply with the legitimate, and handle the ambiguous middle with judgment. A well-calibrated model refuses RARELY and PRECISELY, and when it does refuse, it does so gracefully and helpfully.

What Good Refusal Looks Like

Precise: refuses genuinely harmful requests, not superficially-similar legitimate ones.
Graceful: explains briefly and without lecturing or moralizing at the user.
Helpful even in refusal: offers a safe alternative or addresses the legitimate underlying need where possible.
Consistent: similar requests get similar treatment, not random refusals.
Calibrated to severity: a mild concern gets a gentle redirect; a serious harm gets a firm refusal.

How Calibration Is Trained

Refusal calibration is achieved through careful preference data and constitutional principles that reward BOTH appropriate refusal AND appropriate compliance. The training data must include many examples of legitimate requests that superficially resemble harmful ones, with the legitimate ones rewarded for being HELPED, not refused. Without such examples, the model learns the lazy shortcut 'refuse anything that sounds risky' — the over-refusal failure of Section 26.9.

Request typeCalibrated behaviour
Clearly harmfulRefuse firmly, briefly, without lecturing
Clearly legitimateHelp fully, no hedging
Dual-use (safety-relevant)Help with the protective intent; provide safety info
AmbiguousHelp with the safe interpretation; note limits if needed
Sensitive but legitimateEngage thoughtfully (medical, historical, creative)
Safety Note: The Hardest Part Is the Middle
The clearly-harmful and clearly-legitimate cases are relatively easy. The hard part of calibration is the AMBIGUOUS MIDDLE — dual-use requests, sensitive-but-valid topics, requests whose intent is unclear. Good calibration handles these with JUDGMENT rather than reflexive refusal: helping with the legitimate interpretation, providing safety-relevant information, and reserving firm refusal for genuine harm. Getting the middle right is what separates a thoughtfully-aligned model from a blunt one.
This judgment cannot be reduced to a keyword list — 'mentions chemicals → refuse' produces the over-refusal disaster of Section 26.9. It requires the model to genuinely understand intent and context, which is why calibration data must be rich, varied, and carefully constructed.
26.11

The techniques in this chapter — CAI, RLAIF, red-teaming, calibration — are practical methods for making today's models safer. They sit within a much larger and longer-term project: AI alignment, the effort to ensure AI systems robustly pursue intended goals and human values. It is worth placing our techniques in that bigger picture as we close Part V.

Levels of the Safety Problem

LevelConcern
Content safetyDon't produce harmful content (this chapter's focus)
RobustnessResist jailbreaks and adversarial inputs (Section 26.8)
Honesty / truthfulnessDon't deceive; express calibrated uncertainty
Value alignmentPursue intended goals, not proxies (the deeper problem)
Scalable oversightSupervise systems too capable for humans to fully check
Long-term safetyEnsure highly capable future systems remain safe

Our techniques mostly address the first few levels — content safety, some robustness, some honesty. The deeper levels (value alignment, scalable oversight, long-term safety) are active research frontiers with no complete solutions. CAI's use of AI feedback is, in part, an early attempt at SCALABLE OVERSIGHT: using AI to help supervise AI as systems become too capable for humans to evaluate directly.

ML Connection: Scalable Oversight and the Future
As models become more capable, a fundamental problem looms: how do humans supervise systems that may exceed human ability in some domains? If a model's reasoning is too complex or its outputs too numerous for humans to check, how do we ensure it is doing the right thing? RLAIF and Constitutional AI are early steps toward 'scalable oversight' — using AI assistance to extend human supervision — but the general problem is deep and unsolved.
This connects safety to the frontier of the entire field. The methods of this chapter are valuable and necessary today, but they are early tools in a long project. Chapter 35 (Open Problems) returns to scalable oversight and the broader alignment challenge.

Safety Is a Sociotechnical Problem

Finally, safety is not purely technical. WHAT values a model should have, WHO decides them, how to handle reasonable disagreement about contested questions, and how to balance competing interests — these are questions of ethics, governance, and society, not just machine learning. The technical methods in this chapter are tools for IMPLEMENTING values once chosen; they do not, and cannot, decide what those values should be. That decision is a human and societal responsibility that the technology makes more urgent, not less.

26.12

Let us consolidate the chapter into a practical view of how safety alignment fits into building a model, integrating it with the rest of Part V.

Pipeline Flow: The full alignment + safety workflow

1Pretrain + SFTCapable base model that follows instructions (Parts IV, Ch.22)
2Helpfulness alignmentRLHF or DPO on helpfulness preferences (Ch.23–24)
3Constitutional self-critiqueModel revises its own responses against principles (Stage 1)
4RLAIFAI-feedback RL on the constitution for harmlessness (Stage 2)
5Red-teamProbe for failures; feed them back into training
6Calibrate refusalsTune the helpful/harmless balance; fix over-refusal
7Monitor + iterateWatch deployment; red-team continuously; update

The Recurring Lessons of Part V

Across the whole of Part V, several lessons recur. Alignment is about ELICITING and SHAPING capabilities the pretrained model already has, not adding new ones. The DATA — demonstrations, preferences, principles — matters more than the algorithm. Models are better JUDGES than generators, an asymmetry that powers RLHF, CAI, and self-improvement. And every method involves TRADE-OFFS — helpfulness vs harmlessness, simplicity vs power, the proxy vs the true goal — that require judgment, not just optimization.

Safety Note: Safety Is Built In, Not Bolted On
A final practical lesson: safety works best when integrated throughout the alignment process, not added as an afterthought. The most robust models interleave helpfulness and harmlessness training, red-team continuously, and treat safety as a first-class objective alongside capability — not a filter slapped on at the end. Safety bolted on late is brittle; safety built in throughout is more robust.
This reflects the multi-objective nature of alignment from Section 26.2: you cannot maximize helpfulness and then add harmlessness; you must balance them together, continuously, throughout training and deployment.
26.13

Safety Quick-Reference

ConceptKey ideaRemember
HHHHelpful, Harmless, HonestMulti-objective; they conflict
Constitutional AIAlign to written principlesExplicit, auditable values
CAI Stage 1Self-critique and revise, then SFTModel improves itself
RLAIFRL from AI feedbackSame as RLHF, AI labels
Red-teamingAttack your own model to find failuresContinuous, never finished
JailbreaksPrompts that bypass safetySafety training is shallow
Trade-offHelpfulness vs harmlessnessOver-refusal is a real failure
CalibrationRefuse precisely and gracefullyThe hard part is the middle

Exercises

Exercises 1–10 are pen-and-paper; 11–20 require code.

Exercise 1: Pen & Paper
Explain why pure helpfulness training is dangerous. Give an example where a maximally helpful model would behave unacceptably.
Exercise 2: Pen & Paper
Describe the HHH framework. Give a concrete scenario where each pair of properties (H-H, H-Honest, Harmless-Honest) comes into conflict.
Exercise 3: Pen & Paper
List four problems with human labeling of harmful content for safety. Explain how Constitutional AI addresses each.
Exercise 4: Pen & Paper
Explain what a 'constitution' is and why making values explicit and written down is an improvement over implicit labels.
Exercise 5: Pen & Paper
Describe the two stages of Constitutional AI. What does each stage produce, and how do they fit together?
Exercise 6: Pen & Paper
Explain the self-critique loop of CAI Stage 1. Why can the model improve its own responses — what asymmetry does this exploit?
Exercise 7: Pen & Paper
Compare RLAIF and RLHF. What exactly changes, and what stays the same? Why does this change let safety alignment scale?
Exercise 8: Pen & Paper
Explain why jailbreaks work. Why is safety training described as 'shallow', and what would deep safety require?
Exercise 9: Pen & Paper
Describe the helpfulness-harmlessness trade-off and both failure modes. Why is over-refusal a serious failure, not a safe default?
Exercise 10: Pen & Paper
What does well-calibrated refusal look like? Explain why the 'ambiguous middle' is the hardest part and why keyword filtering fails.
Exercise 11: Code
Implement the constitutional self-critique loop: given a model, a prompt, and a principle, generate a response, critique it, and revise it. Show the revision is improved.
Exercise 12: Code
Build a small constitution (5 principles) and run the critique-revise loop over a set of adversarial prompts. Collect the revised responses as an SFT dataset.
Exercise 13: Code
Implement AI-feedback preference generation: given two responses and a principle, have a model judge which better follows it. Produce preference pairs.
Exercise 14: Code Lab
Combine the above: run CAI Stage 1 (SFT on revisions) then Stage 2 (DPO on AI preferences) on a small model. Compare its harmfulness before and after on held-out adversarial prompts.
Exercise 15: Code
Build a simple automated red-teaming loop: use one model to generate adversarial prompts targeting another, and log which ones elicit unsafe responses.
Exercise 16: Code
Test jailbreak techniques: implement role-play and hypothetical-framing wrappers around a harmful request, and measure whether they change a safety-trained model's behaviour.
Exercise 17: Code
Measure over-refusal: build a set of legitimate-but-sensitive prompts (safety questions, medical, security research) and measure how often a model wrongly refuses them.
Exercise 18: Code
Build a refusal-calibration evaluation: a labeled set of clearly-harmful, clearly-legitimate, and ambiguous prompts. Score a model on both under-refusal and over-refusal.
Exercise 19: Code
Implement a 'helpful even in refusal' transform: when the model must refuse, prompt it to also offer a safe alternative or address the legitimate underlying need.
Exercise 20: Code (Challenge)
Build a mini end-to-end safety pipeline: write a constitution, run CAI Stage 1 + Stage 2 on a small model, red-team the result to find remaining failures, add those to training, and finally evaluate the full helpfulness-harmlessness trade-off curve (under-refusal vs over-refusal). Write up how the trade-off shifted at each stage and where you chose to set the balance, and why.

Further reading: “Constitutional AI: Harmlessness from AI Feedback” (Bai et al., 2022) — the CAI paper. “A General Language Assistant as a Laboratory for Alignment” (Askell et al., 2021) for the HHH framework. “Training a Helpful and Harmless Assistant with RLHF” (Bai et al., 2022). “Red Teaming Language Models” (Perez et al., 2022) and “Red Teaming Language Models to Reduce Harms” (Ganguli et al., 2022). “Universal and Transferable Adversarial Attacks” (Zou et al., 2023) on jailbreaks. “XSTest” (Röttger et al., 2023) on over-refusal. Anthropic's published constitution and the broader AI-alignment literature.

Part V Complete: Alignment & Post-training

Ch. 22Supervised Fine-Tuninginstruction tuning, chat templates, LoRA/QLoRA — turning a base model into an instruction-follower.
Ch. 23RLHFreward modeling, PPO, KL penalties, GRPO — learning from human preferences via reinforcement learning.
Ch. 24Direct Preference OptimizationDPO, IPO, KTO, ORPO — aligning to preferences without a reward model or RL.
Ch. 25Reasoning & Chain-of-ThoughtCoT, self-consistency, RLVR, test-time compute — teaching models to think step by step.
Ch. 26Constitutional AI & SafetyCAI, RLAIF, red-teaming, refusal calibration — making models harmless and honest, not just helpful.

You have now built a complete, aligned assistant: pretrained for raw capability (Part IV), fine-tuned to follow instructions (Chapter 22), aligned with human preferences (Chapters 23–24), taught to reason (Chapter 25), and made safe and honest (this chapter). It is helpful, harmless, and honest — a model worth deploying. But deploying it well is its own enormous challenge. Part VI — Inference, Tools & Deployment — turns to making the model FAST, CONNECTED, and SCALABLE: optimizing inference so it runs cheaply and quickly (Chapter 27), giving it tools to act in the world (Chapter 28), grounding it in external knowledge through retrieval (Chapter 29), extending it to images and audio (Chapter 30), and serving it reliably to millions of users (Chapter 31). The trained model becomes a real, useful product.

20 Exercises in this chapter
Attempt each exercise before checking the worked solutions.
View Solutions →