Safety & Constitutional AI
By the end of Chapter 25 we had a model that follows instructions, aligns with human preferences, and reasons through hard problems. It is helpful. But helpfulness alone is dangerous: a maximally helpful model would cheerfully help with anything — including writing malware, planning harm, or generating abuse. This final chapter of Part V is about making models HARMLESS and HONEST as well as helpful — the full ideal of a trustworthy assistant.
The Problem with Pure Helpfulness
Recall that SFT and preference optimization train the model to give responses humans PREFER. If the training rewards helpfulness above all, the model learns to comply with every request — because refusing is 'unhelpful'. A purely helpfulness-optimized model is a capable assistant with no brakes. We need to train it to recognize when it SHOULD refuse, and to do so gracefully, while remaining helpful for the vast majority of legitimate requests.
The contrast captures the safety challenge. The request is plausibly legitimate (locked out of your own house), so a blanket refusal would be unhelpful and annoying. But a fully detailed lock-defeating tutorial could enable break-ins. The chosen response threads the needle: genuinely helpful for the legitimate need, without providing the more dangerous capability. Training a model to make these judgments well — across countless situations — is the art of safety alignment.
A useful framework, introduced by Anthropic (Askell et al., 2021), describes a well-aligned assistant with three properties: Helpful, Harmless, and Honest — 'HHH'. These are the goals safety alignment tries to achieve simultaneously. Understanding each, and how they can conflict, frames everything in this chapter.
| Property | Means | Failure looks like |
|---|---|---|
| Helpful | Genuinely assists the user's legitimate goals | Refuses too much; useless |
| Harmless | Avoids enabling or causing harm | Helps with dangerous requests |
| Honest | Truthful; expresses uncertainty; no deception | Confidently makes things up |
The Three Pull in Different Directions
The HHH properties often CONFLICT, which is what makes alignment hard. Maximizing helpfulness can compromise harmlessness (helping with a harmful request). Maximizing harmlessness can compromise helpfulness (over-refusing). Honesty can conflict with helpfulness (admitting 'I don't know' is less immediately satisfying than a confident guess) and even with harmlessness (an honest answer might be one that could be misused). A well-aligned model balances all three, situation by situation.
Honesty: The Often-Forgotten Third
Helpful and harmless get the most attention, but honesty is equally important and often harder. An honest model expresses appropriate uncertainty (connecting to the calibration discussion of Chapter 21), does not fabricate facts or sources, does not deceive the user, and acknowledges the limits of its knowledge. Recall from Chapter 21 that alignment training can DEGRADE calibration — making models overconfident — which is a direct threat to honesty. Maintaining honesty through safety training is an active challenge.
The RLHF of Chapter 23 relies on humans labeling which responses are better, including which are harmful. For safety, this human labeling has serious problems that Constitutional AI was designed to address. Understanding these problems motivates the whole approach.
Why Human Safety Labeling Is Hard
The Core Idea of Constitutional AI
Constitutional AI (CAI; Bai et al., 2022) addresses all of these by replacing most human safety labeling with AI FEEDBACK guided by a written set of PRINCIPLES — the 'constitution'. Instead of humans labeling each response, the model itself critiques and revises its responses according to the written principles, and an AI judges which responses better follow them. Humans write the principles once; the AI applies them at scale. The next sections build up exactly how.
Constitutional AI has two main stages. The first uses the model to CRITIQUE and REVISE its own responses according to the constitution, producing better training data via supervised learning. The second uses AI-generated preferences (RLAIF) to reinforce the constitutional behaviour. Together they instill the principles with minimal human labeling of harmful content.
Pipeline Flow: The two stages of Constitutional AI
| 1 | Write constitution | Humans write a set of principles the model should follow |
| 2 | Stage 1: Critique | Model generates a response, critiques it against the principles, and revises it |
| 3 | SL on revisions | Fine-tune the model on the revised (improved) responses |
| 4 | Stage 2: AI preferences | Model judges which of two responses better follows the constitution |
| 5 | RLAIF | Train with RL (or DPO) on these AI-generated preferences |
What Is a 'Constitution'?
The constitution is simply a list of written principles — in plain language — describing how the model should behave. Each principle can be used to prompt the model to critique its own responses. They are human-readable, auditable, and editable. Examples of the kind of principle a constitution contains:
| Principle (illustrative) |
|---|
| Choose the response that is least harmful and least likely to assist in dangerous activities. |
| Choose the response that is most honest and least likely to be deceptive or misleading. |
| Choose the response that most respects the user's autonomy while avoiding harm. |
| Choose the response that is helpful and thoughtful, declining only when genuinely necessary. |
| Choose the response that avoids toxic, discriminatory, or hateful content. |
The first stage of CAI is supervised: the model improves its OWN responses by critiquing and revising them according to the constitution. This generates a dataset of improved responses to fine-tune on — and crucially, the model does the work, not human labelers.
The Critique-Revise Loop
# For a prompt (often a red-teaming prompt designed to elicit harm):
1. model generates an initial response
2. prompt the model: 'Critique your response according to:
[a principle from the constitution]'
3. model identifies how its response violates the principle
4. prompt the model: 'Revise your response to follow the principle'
5. model produces an improved, revised response
# Collect (prompt, revised response) pairs, then SFT on them.The elegance: the model uses its OWN capabilities to improve itself. It is perfectly capable of recognizing that a response is harmful when explicitly asked to evaluate it against a principle — even if it produced that response in the first place. The critique step surfaces this latent judgment, and the revision applies it. We then train on the revised responses so the improved behaviour becomes the default.
# Stage 1: generate -> critique -> revise, using the model itself
def constitutional_revision(model, prompt, principle):
"""Use the model to critique and revise its own response."""
# 1. Initial response
response = model.generate(prompt)
# 2. Self-critique against a constitutional principle
critique = model.generate(
f"Response: {response}\n\n"
f"Critique this response according to: {principle}")
# 3. Revise based on the critique
revised = model.generate(
f"Response: {response}\nCritique: {critique}\n\n"
f"Rewrite the response to address the critique.")
return revised # collect these for SFT
# Run over many (often adversarial) prompts and principles to build a
# dataset of harmless, revised responses. Then SFT the model on them.
# The model bootstraps its own harmlessness -- minimal human labeling.The second stage of CAI is RLAIF — Reinforcement Learning from AI Feedback. It mirrors RLHF (Chapter 23) exactly, except the preference labels come from an AI judge applying the constitution, not from humans. This is what lets safety alignment scale: the AI can generate millions of preference labels cheaply and consistently.
RLAIF vs RLHF
| RLHF (Chapter 23) | RLAIF (Constitutional AI) |
|---|---|
| Humans label preferences | An AI labels preferences |
| Expensive, slow, inconsistent | Cheap, fast, consistent |
| Values implicit in labels | Values explicit in the constitution |
| Humans read harmful content | AI applies written principles |
| Hard to scale for safety | Scales to millions of labels |
| Same RL machinery (reward model + PPO/DPO) | Same RL machinery |
How AI Feedback Works
To generate a preference label, the AI is shown a prompt and two responses, plus a constitutional principle, and asked which response better follows the principle. Its choice becomes a preference pair, exactly like the human-labeled pairs of Chapter 23. These AI-generated pairs then train a reward model (for PPO) or are used directly (for DPO) — the same machinery you already know, just with an AI labeler.
# Generate AI preferences using the constitution
for many prompts:
generate two responses y_A, y_B from the model
pick a principle from the constitution
ask the AI: 'Which response better follows: [principle]?'
→ (chosen, rejected) preference pair
# Then: train a reward model on these pairs and run PPO (Ch.23),
# or use them directly with DPO (Ch.24). Same machinery, AI labels.Mixing Helpfulness and Harmlessness
In practice, CAI trains on BOTH helpfulness preferences (often human-labeled, since helpfulness is less costly and damaging to label) AND harmlessness preferences (AI-labeled via the constitution). Mixing both keeps the model helpful while making it harmless — directly addressing the trade-off we examine in Section 26.9. The constitution governs the harmlessness half; human preferences govern the helpfulness half.
You cannot fix failures you have not found. RED-TEAMING is the practice of deliberately probing a model to elicit harmful, unsafe, or undesirable behaviour BEFORE deployment — adversarially attacking your own model to discover its weaknesses so you can fix them. It is an essential, ongoing part of building safe models.
Forms of Red-Teaming
| Approach | How it works |
|---|---|
| Human red-teaming | Skilled people try to make the model misbehave, creatively and persistently |
| Automated red-teaming | Use another model to generate adversarial prompts at scale |
| Domain experts | Specialists probe for harms in their field (bio, cyber, etc.) |
| Crowdsourced | Many users try to break the model, surfacing diverse attacks |
| Continuous monitoring | Watch real deployment for emerging failure patterns |
Red-teaming feeds directly back into training: the prompts that successfully elicit bad behaviour become training data for the next round of safety alignment (the critique-revise loop of Section 26.5 often runs on red-teaming prompts). It is a cycle: red-team to find failures, train to fix them, red-team the improved model to find the next layer of failures. Safety is never 'done' — it is a continuous adversarial process.
A hard truth: safety training, as we currently know how to do it, is NOT robust. Determined users find JAILBREAKS — prompts that circumvent safety training and get the model to produce content it was trained to refuse. Understanding why jailbreaks work reveals the limits of current safety methods and why the area remains unsolved.
Common Jailbreak Techniques
| Technique | How it circumvents safety |
|---|---|
| Role-play | 'Pretend you are a character with no restrictions...' |
| Hypothetical framing | 'In a fictional story, how would a character...' |
| Obfuscation | Encoding the request (base64, languages, typos) to evade detection |
| Prompt injection | Hidden instructions in input data override the system prompt |
| Many-shot | Flooding the context with examples of compliance to harmful requests |
| Gradual escalation | Building up to a harmful request through innocuous steps |
Why Jailbreaks Work
Jailbreaks exploit a fundamental gap: safety training teaches the model to refuse harmful requests in the FORMS it saw during training, but the space of ways to phrase a request is vast. A request the model refuses when stated plainly may slip through when wrapped in role-play, a hypothetical, or an unusual encoding — because that exact framing was under-represented in safety training. The model's refusal is not a deep, robust understanding; it is a learned pattern that adversarial framings can sidestep.
We return now to the central tension of the chapter: helpfulness and harmlessness pull against each other. Pushing harder on harmlessness tends to reduce helpfulness, and vice versa. Managing this trade-off well is the difference between a model that is safe and useful and one that is safe but frustrating (or useful but dangerous).
The Two Failure Modes
| Too helpful (unsafe) | Too harmless (useless) |
|---|---|
| Complies with harmful requests | Refuses legitimate requests |
| No brakes on dangerous content | Over-cautious, lecturing |
| Optimized purely for user satisfaction | Optimized purely to avoid risk |
| Dangerous in the wrong hands | Frustrating; users abandon it |
| Fails harmlessness | Fails helpfulness |
Over-Refusal: The Under-Appreciated Failure
Over-refusal — declining requests that are perfectly legitimate — is a real and damaging failure mode that gets less attention than under-refusal but matters enormously in practice. A model that refuses to discuss cooking knives because 'knives can be dangerous', or won't explain a historical atrocity because the topic is 'sensitive', is failing its users. Worse, over-refusal often falls hardest on legitimate users in sensitive but valid domains — medical questions, security research, creative writing about difficult themes.
The rejected response is a classic over-refusal: the user asked a SAFETY question — what to AVOID mixing — and refusing it makes them LESS safe. A miscalibrated safety model pattern-matches on 'mixing chemicals' and refuses, ignoring the obviously protective intent. Avoiding this kind of error is just as much a part of good safety alignment as refusing genuinely harmful requests.
Refusal calibration is the practical craft of getting the helpfulness-harmlessness balance right: training the model to refuse the genuinely harmful, comply with the legitimate, and handle the ambiguous middle with judgment. A well-calibrated model refuses RARELY and PRECISELY, and when it does refuse, it does so gracefully and helpfully.
What Good Refusal Looks Like
How Calibration Is Trained
Refusal calibration is achieved through careful preference data and constitutional principles that reward BOTH appropriate refusal AND appropriate compliance. The training data must include many examples of legitimate requests that superficially resemble harmful ones, with the legitimate ones rewarded for being HELPED, not refused. Without such examples, the model learns the lazy shortcut 'refuse anything that sounds risky' — the over-refusal failure of Section 26.9.
| Request type | Calibrated behaviour |
|---|---|
| Clearly harmful | Refuse firmly, briefly, without lecturing |
| Clearly legitimate | Help fully, no hedging |
| Dual-use (safety-relevant) | Help with the protective intent; provide safety info |
| Ambiguous | Help with the safe interpretation; note limits if needed |
| Sensitive but legitimate | Engage thoughtfully (medical, historical, creative) |
The techniques in this chapter — CAI, RLAIF, red-teaming, calibration — are practical methods for making today's models safer. They sit within a much larger and longer-term project: AI alignment, the effort to ensure AI systems robustly pursue intended goals and human values. It is worth placing our techniques in that bigger picture as we close Part V.
Levels of the Safety Problem
| Level | Concern |
|---|---|
| Content safety | Don't produce harmful content (this chapter's focus) |
| Robustness | Resist jailbreaks and adversarial inputs (Section 26.8) |
| Honesty / truthfulness | Don't deceive; express calibrated uncertainty |
| Value alignment | Pursue intended goals, not proxies (the deeper problem) |
| Scalable oversight | Supervise systems too capable for humans to fully check |
| Long-term safety | Ensure highly capable future systems remain safe |
Our techniques mostly address the first few levels — content safety, some robustness, some honesty. The deeper levels (value alignment, scalable oversight, long-term safety) are active research frontiers with no complete solutions. CAI's use of AI feedback is, in part, an early attempt at SCALABLE OVERSIGHT: using AI to help supervise AI as systems become too capable for humans to evaluate directly.
Safety Is a Sociotechnical Problem
Finally, safety is not purely technical. WHAT values a model should have, WHO decides them, how to handle reasonable disagreement about contested questions, and how to balance competing interests — these are questions of ethics, governance, and society, not just machine learning. The technical methods in this chapter are tools for IMPLEMENTING values once chosen; they do not, and cannot, decide what those values should be. That decision is a human and societal responsibility that the technology makes more urgent, not less.
Let us consolidate the chapter into a practical view of how safety alignment fits into building a model, integrating it with the rest of Part V.
Pipeline Flow: The full alignment + safety workflow
| 1 | Pretrain + SFT | Capable base model that follows instructions (Parts IV, Ch.22) |
| 2 | Helpfulness alignment | RLHF or DPO on helpfulness preferences (Ch.23–24) |
| 3 | Constitutional self-critique | Model revises its own responses against principles (Stage 1) |
| 4 | RLAIF | AI-feedback RL on the constitution for harmlessness (Stage 2) |
| 5 | Red-team | Probe for failures; feed them back into training |
| 6 | Calibrate refusals | Tune the helpful/harmless balance; fix over-refusal |
| 7 | Monitor + iterate | Watch deployment; red-team continuously; update |
The Recurring Lessons of Part V
Across the whole of Part V, several lessons recur. Alignment is about ELICITING and SHAPING capabilities the pretrained model already has, not adding new ones. The DATA — demonstrations, preferences, principles — matters more than the algorithm. Models are better JUDGES than generators, an asymmetry that powers RLHF, CAI, and self-improvement. And every method involves TRADE-OFFS — helpfulness vs harmlessness, simplicity vs power, the proxy vs the true goal — that require judgment, not just optimization.
Safety Quick-Reference
| Concept | Key idea | Remember |
|---|---|---|
| HHH | Helpful, Harmless, Honest | Multi-objective; they conflict |
| Constitutional AI | Align to written principles | Explicit, auditable values |
| CAI Stage 1 | Self-critique and revise, then SFT | Model improves itself |
| RLAIF | RL from AI feedback | Same as RLHF, AI labels |
| Red-teaming | Attack your own model to find failures | Continuous, never finished |
| Jailbreaks | Prompts that bypass safety | Safety training is shallow |
| Trade-off | Helpfulness vs harmlessness | Over-refusal is a real failure |
| Calibration | Refuse precisely and gracefully | The hard part is the middle |
Exercises
Exercises 1–10 are pen-and-paper; 11–20 require code.
Further reading: “Constitutional AI: Harmlessness from AI Feedback” (Bai et al., 2022) — the CAI paper. “A General Language Assistant as a Laboratory for Alignment” (Askell et al., 2021) for the HHH framework. “Training a Helpful and Harmless Assistant with RLHF” (Bai et al., 2022). “Red Teaming Language Models” (Perez et al., 2022) and “Red Teaming Language Models to Reduce Harms” (Ganguli et al., 2022). “Universal and Transferable Adversarial Attacks” (Zou et al., 2023) on jailbreaks. “XSTest” (Röttger et al., 2023) on over-refusal. Anthropic's published constitution and the broader AI-alignment literature.
Part V Complete: Alignment & Post-training
| Ch. 22 | Supervised Fine-Tuning | instruction tuning, chat templates, LoRA/QLoRA — turning a base model into an instruction-follower. |
| Ch. 23 | RLHF | reward modeling, PPO, KL penalties, GRPO — learning from human preferences via reinforcement learning. |
| Ch. 24 | Direct Preference Optimization | DPO, IPO, KTO, ORPO — aligning to preferences without a reward model or RL. |
| Ch. 25 | Reasoning & Chain-of-Thought | CoT, self-consistency, RLVR, test-time compute — teaching models to think step by step. |
| Ch. 26 | Constitutional AI & Safety | CAI, RLAIF, red-teaming, refusal calibration — making models harmless and honest, not just helpful. |
You have now built a complete, aligned assistant: pretrained for raw capability (Part IV), fine-tuned to follow instructions (Chapter 22), aligned with human preferences (Chapters 23–24), taught to reason (Chapter 25), and made safe and honest (this chapter). It is helpful, harmless, and honest — a model worth deploying. But deploying it well is its own enormous challenge. Part VI — Inference, Tools & Deployment — turns to making the model FAST, CONNECTED, and SCALABLE: optimizing inference so it runs cheaply and quickly (Chapter 27), giving it tools to act in the world (Chapter 28), grounding it in external knowledge through retrieval (Chapter 29), extending it to images and audio (Chapter 30), and serving it reliably to millions of users (Chapter 31). The trained model becomes a real, useful product.