Part V: Alignment & RLHF

Chapter 25

Reasoning & Chain-of-Thought

Teaching models to think step by step: chain-of-thought prompting, self-consistency, process reward models, search, and the test-time-compute scaling behind o1- and R1-style reasoning models.

22 Exercises

Learning Objectives

1.	Explain why direct next-token answering struggles with multi-step problems.
2.	Use chain-of-thought (CoT) prompting and understand why it works.
3.	Apply self-consistency to improve reasoning by sampling and voting.
4.	Understand test-time compute scaling as a new axis beyond training scale.
5.	Train reasoning with SFT on reasoning traces and with RL on verifiable rewards.
6.	Distinguish outcome reward models (ORM) from process reward models (PRM).
7.	Understand search methods over reasoning: best-of-N, beam, and MCTS.
8.	Explain how RL produced emergent reasoning in R1-style models.
9.	Reason about the cost, faithfulness, and limits of chain-of-thought.
10.	Connect reasoning training to the GRPO and RLVR methods of Chapters 23–24.

So far, our models answer in one shot: given a prompt, they generate a response directly. For many tasks this is fine. But for problems that require multiple steps of REASONING — a math word problem, a logic puzzle, a multi-hop question — direct answering fails surprisingly often, even for very capable models. This chapter is about giving models room to THINK before they answer, and the dramatic capability gains that follow.

The Core Difficulty

Consider asking a model: 'A juggler has 16 balls. Half are golf balls, and half of the golf balls are blue. How many blue golf balls are there?' To answer correctly, you must compute 16/2 = 8 golf balls, then 8/2 = 4 blue golf balls. That is two sequential steps. A model forced to emit the answer immediately has to do BOTH steps in a single forward pass, with no room to write down the intermediate result — and it often gets it wrong.

✧

Preference Pair

Prompt: A juggler has 16 balls. Half are golf balls, and half of the golf balls are blue. How many blue golf balls?

Chosen: Let me work through this step by step. Half of 16 balls are golf balls: 16 / 2 = 8 golf balls. Half of those are blue: 8 / 2 = 4. So there are 4 blue golf balls.

Rejected: 8 blue golf balls.

The rejected answer (8) is a classic error: the model grabbed an intermediate number without finishing the computation. The chosen answer works through the steps and gets 4. The difference is not knowledge — both 'know' the arithmetic — it is whether the model was given room to perform the steps sequentially.

A Useful Analogy: System 1 and System 2

Psychologists describe two modes of human thinking: System 1 (fast, automatic, intuitive) and System 2 (slow, deliberate, step-by-step). Answering '2+2' uses System 1; working out a long-division problem uses System 2. A model answering directly is like a person forced to blurt out the first thing that comes to mind — pure System 1. Reasoning techniques give the model a System 2 mode: the ability to slow down and work through a problem deliberately.

✧

Intuition: A Forward Pass Is Fixed Computation

Here is the deep reason direct answering struggles. When a model produces the next token, it performs a FIXED amount of computation — one forward pass through its layers. That is a fixed budget of 'thinking' per token, no matter how hard the problem. A two-step problem and a ten-step problem get the same per-token compute. The model cannot 'try harder' on a harder problem if it must answer in one token.

The breakthrough idea of this chapter is simple: let the model use MORE TOKENS to solve harder problems. Each generated token is another forward pass — another increment of computation. By generating intermediate reasoning steps, the model spreads a hard problem across many forward passes, giving itself the compute it needs. Tokens become thinking time.

The simplest way to give a model room to reason is also one of the most important discoveries in prompting: chain-of-thought (CoT). Instead of asking for the answer directly, you prompt the model to produce its REASONING first, then the answer. Astonishingly, this single change dramatically improves performance on reasoning tasks — with no training at all.

Few-Shot Chain-of-Thought

Chain-of-thought prompting was introduced by Wei et al. (2022). The idea: in your few-shot examples (Chapter 21), show not just the answer but the step-by-step reasoning that leads to it. The model then imitates this pattern, producing its own reasoning steps before answering the new question.

Python•Few-shot chain-of-thought prompt
# A few-shot example that DEMONSTRATES reasoning, not just the answer:
Q: Roger has 5 tennis balls. He buys 2 cans of 3 balls each. How many now?
A: Roger started with 5 balls. 2 cans of 3 balls each is 2 × 3 = 6 balls.
   5 + 6 = 11. The answer is 11.

# Now the model imitates the step-by-step pattern on a NEW question:
Q: A cafe had 23 apples. They used 20 and bought 6 more. How many now?
A: <- the model now generates reasoning before answering
   The cafe started with 23 apples. They used 20, leaving 23 - 20 = 3.
   They bought 6 more: 3 + 6 = 9. The answer is 9.

Zero-Shot Chain-of-Thought

Even simpler: Kojima et al. (2022) found that just appending the phrase 'Let's think step by step' to a prompt triggers reasoning behaviour, with NO examples needed. This 'zero-shot CoT' is remarkably effective — a single magic phrase that flips the model from System-1 blurting to System-2 deliberation.

Python•Zero-shot chain-of-thought
# Without the magic phrase -- model often answers wrong:
Q: A juggler has 16 balls, half golf balls, half of those blue. Blue golf balls?
A: -> '8' (wrong, grabbed an intermediate number)

# WITH the magic phrase -- model reasons and gets it right:
Q: A juggler has 16 balls, half golf balls, half of those blue. Blue golf balls?
A: Let's think step by step.
   -> '16 / 2 = 8 golf balls. 8 / 2 = 4 blue. The answer is 4.' (correct)

# The phrase costs nothing but unlocks step-by-step computation.

✧

CoT Was Hiding in the Model All Along

A profound point: chain-of-thought does not teach the model anything new. The ability to reason step by step was already latent in the pretrained model — it saw millions of worked examples, proofs, and explanations during pretraining. CoT prompting simply ELICITS that latent ability by steering the model into the step-by-step mode. This echoes the superficial alignment hypothesis (Chapter 22): the capability is in the base model; the prompt just unlocks it.

This is also why CoT works better as models get larger: bigger models have more latent reasoning ability to elicit. CoT's benefit grows with scale, which is one reason reasoning became a frontier focus as models grew.

Why does generating reasoning steps help so much? The answer connects back to the intuition from Section 25.1 and reveals something deep about how these models compute. Understanding it makes everything later in the chapter clearer.

Reasoning Tokens Are Extra Computation

Recall that each generated token is one forward pass — one fixed increment of computation. When the model generates a chain of reasoning, each reasoning token is an additional forward pass, and crucially, each one can ATTEND to all the previous reasoning tokens. So the model builds up a solution incrementally: step one is computed and written down, step two attends to step one, step three attends to both, and so on. The reasoning chain is a SCRATCHPAD that holds intermediate results the model can build on.

text•Reasoning spreads computation across tokens
Direct answer:   1 forward pass must solve the whole problem.

CoT with k reasoning tokens:  k+1 forward passes,
    each attending to all prior reasoning — a working scratchpad.

More reasoning tokens  →  more total computation  →  harder problems solvable.

Decomposing Hard Problems Into Easy Steps

There is a second reason CoT works. A hard problem is often a chain of easy steps. Each individual step — '16 / 2 = 8' — is something the model can do reliably in one forward pass. By breaking the problem into these easy steps and writing each down, the model converts one hard problem (which it would fail) into a sequence of easy problems (which it can each solve). The scratchpad lets it tackle them one at a time.

✧

Intuition: Why You Can't Multiply Big Numbers in Your Head

Think about multiplying 347 × 89 in your head versus on paper. In your head (one 'forward pass') it is very hard. On paper, you break it into easy single-digit multiplications and additions, writing down intermediate results — and it becomes routine. The paper is your scratchpad; it holds the intermediate results so your limited working memory does not have to.

Chain-of-thought is exactly this paper for the model. It lets the model write down intermediate results instead of holding everything in a single forward pass. The model was always capable of each small step; CoT just gives it somewhere to put the intermediate work.

▶

ML Connection: A New Lever on Capability

CoT revealed something the field had underappreciated: you can make a model dramatically more capable WITHOUT changing its weights at all — just by letting it generate more reasoning. This is the seed of the test-time-compute revolution (Section 25.5): capability is not fixed by the trained weights; it can be bought at inference time with more thinking. This realization reshaped how the frontier thinks about scaling.

It also reframes evaluation: a model's 'capability' depends on how much thinking it is allowed to do. A model that fails when forced to answer immediately may succeed when allowed to reason — same weights, different compute budget.

Chain-of-thought generates ONE reasoning path. But a single path can go wrong — a careless slip early on dooms the whole chain. Self-consistency (Wang et al., 2022) improves reliability with a simple idea: generate MANY reasoning paths (by sampling), then take the most common final answer. It is the reasoning equivalent of asking many people and going with the majority.

How Self-Consistency Works

text•Self-consistency (Pseudocode)
# Generate diverse reasoning paths, then vote on the answer
1. sample N chain-of-thought responses (with temperature > 0)
2. extract the final answer from each path
3. return the MOST COMMON final answer (majority vote)

# Different paths may reason differently but converge on the truth;
# errors tend to be idiosyncratic and get out-voted.

The intuition: there are many correct ways to reason to the right answer, but each wrong path tends to err in its own idiosyncratic way. So the correct answer recurs across many paths (they converge on it), while the wrong answers scatter. Majority voting surfaces the convergent, correct answer and out-votes the scattered errors.

Python•Self-consistency from scratch
from collections import Counter

def self_consistency(model, prompt, n_samples=16, temperature=0.7):
    """Sample N reasoning paths and majority-vote the final answer."""
    answers = []
    for _ in range(n_samples):
        # Sample a full chain-of-thought (temperature gives diversity)
        cot = model.generate(prompt + ' Let\'s think step by step.',
                             temperature=temperature)
        ans = extract_final_answer(cot)   # parse 'The answer is X'
        answers.append(ans)

    # Majority vote over the final answers
    return Counter(answers).most_common(1)[0][0]

# More samples -> higher accuracy, up to a point. On math benchmarks,
# self-consistency with 16-40 paths can add many points over single CoT.
# This is our first taste of TRADING INFERENCE COMPUTE for accuracy.

✧

Reward Note: Self-Consistency Is Test-Time Compute in Disguise

Notice what self-consistency really is: spending MORE compute at inference (N generations instead of 1) to get a better answer. This is the first concrete example of the central theme of this chapter — test-time compute scaling. You are not changing the model; you are letting it do more work at answer time, and getting reliably better results.

Self-consistency only works when there is a clear final answer to vote on (a number, a multiple-choice letter). For open-ended generation there is nothing to vote on — which is why the more general test-time-compute methods (Section 25.5 onward) are needed.

We now reach the central idea of modern reasoning. Chapter 16's scaling laws were about TRAINING compute — bigger models, more data. But CoT and self-consistency hint at a SECOND scaling axis: TEST-TIME compute. Instead of (or in addition to) training a bigger model, you let the model think LONGER at inference. The o1 models from OpenAI (2024) and the R1 models from DeepSeek (2025) made this the defining paradigm of frontier reasoning.

Two Axes of Scaling

Train-time scaling (Chapter 16)	Test-time scaling (this chapter)
Bigger model, more data	More reasoning at inference
Expensive, one-time cost	Cost per query, controllable
Fixed capability after training	Capability scales with thinking
The 2018–2023 paradigm	The 2024+ reasoning paradigm
6ND FLOPs (Chapter 16)	FLOPs × reasoning length
Better weights	Better use of the weights

The Test-Time Scaling Curve

The striking empirical finding: for reasoning tasks, accuracy improves smoothly and predictably as you allow more test-time compute — longer reasoning chains, more samples, more search. Just as training-loss falls predictably with training compute (Chapter 16), task accuracy rises predictably with test-time compute. This gave the field a new, controllable dial: spend more per query to get a better answer.

text•The test-time scaling relationship
accuracy  ≈  increasing function of (test-time compute)

test-time compute ≈ (reasoning tokens) × (samples or search width)

# More thinking → higher accuracy, smoothly, up to a ceiling.
# You can 'buy' accuracy at inference, per query.

✧

The o1/R1 Paradigm Shift

o1 and R1 are models trained to generate LONG internal reasoning chains — often thousands of tokens of 'thinking' — before producing a final answer. They explore, backtrack, check their work, and reconsider, all in the reasoning chain. The longer they think, the better they do on hard problems. This turned 'thinking time' into a primary capability lever, comparable in impact to model size.

The economic consequence is significant: a smaller model that thinks longer can outperform a larger model that answers immediately, on reasoning tasks. This reshapes the cost equation — you can trade expensive training for controllable inference compute, paying for extra thinking only on the hard queries that need it.

Why This Changes the Game

Test-time compute scaling matters because it partially sidesteps the data wall and the cost of ever-bigger pretraining (Chapter 16). Rather than spending more and more to train a bigger model, you can invest in teaching a model to USE inference compute well — to reason, search, and verify. The rest of this chapter is about how to TRAIN models to do this, since prompting alone (CoT, self-consistency) only goes so far.

Prompting (CoT, self-consistency) elicits reasoning that is already latent in the model. But we can do far better by TRAINING the model to reason — making step-by-step thinking a deeply learned default rather than something coaxed out by a prompt. There are two main training approaches, building on Part V's earlier methods.

Approach 1: SFT on Reasoning Traces

The most direct method is supervised fine-tuning (Chapter 22) on examples that include full reasoning traces. Collect problems paired with detailed step-by-step solutions, and fine-tune the model to produce reasoning-then-answer. This teaches the model to reason by default, even without a 'think step by step' prompt. The reasoning traces can come from humans, from a stronger model (distillation), or from the model's own correct solutions (filtered by whether they got the right answer).

Pipeline Flow: Building a reasoning dataset by rejection sampling

1	Sample	Generate many CoT solutions per problem from the model
2	Verify	Keep only solutions that reach the CORRECT final answer
3	Filter	Optionally keep only clean, concise correct traces
4	SFT	Fine-tune the model on these verified reasoning traces

✧

SFT Note: Distilling Reasoning Into Smaller Models

A powerful and practical finding (from the DeepSeek-R1 work and others): you can take the long reasoning traces produced by a strong reasoning model and SFT a much SMALLER model on them. The small model learns to imitate the reasoning patterns and becomes a capable reasoner itself — far better than its size would suggest. This 'reasoning distillation' is how compact open reasoning models are made.

It connects directly to Chapter 22: reasoning distillation is just SFT where the demonstrations are high-quality reasoning traces. The superficial alignment hypothesis applies — the small model already has latent ability; the traces teach it the reasoning FORMAT and habit.

Approach 2: Reinforcement Learning for Reasoning

SFT on traces is limited by the quality of the traces — the same imitation ceiling we saw in Chapter 23. The more powerful approach is REINFORCEMENT LEARNING: let the model generate its own reasoning, reward it when it reaches correct answers, and let it discover effective reasoning strategies on its own. This is where the RL methods of Chapters 23–24 — especially GRPO — return to center stage, and it is the subject of the next section.

The most important advance in reasoning training is reinforcement learning with VERIFIABLE rewards (RLVR). The key realization: for many reasoning tasks — math, coding, logic — we can CHECK whether the answer is correct AUTOMATICALLY, with no human or learned reward model needed. Math answers can be verified; code can be run against tests. This gives a clean, cheap, hack-resistant reward signal, and it is the perfect setting for the RL methods of Chapter 23.

Verifiable Rewards: No Reward Model Needed

Recall the central weakness of RLHF (Chapter 23): the reward model is an imperfect, hackable proxy. For verifiable tasks, that weakness disappears. The reward is simply whether the answer is CORRECT — computed by a checker, not a learned model. There is nothing to hack: a wrong answer gets zero reward no matter how it is dressed up. This makes RL for reasoning far more robust than RLHF for general alignment.

text•Verifiable reward
r(response) = 1  if the final answer is correct
            = 0  otherwise

# For math: does the boxed answer match the ground truth?
# For code: do the unit tests pass?
# No learned reward model -> no reward hacking of the proxy.

GRPO for Reasoning

GRPO (Chapter 23) is ideally suited to this setting, which is exactly why DeepSeek developed it for reasoning. For each problem, the model generates a GROUP of solution attempts; each is verified (correct or not); the group-relative advantage rewards the attempts that succeeded relative to the group. No value model, no reward model — just generate a group, check correctness, and reinforce what worked. Recall the GRPO advantage from Chapter 23:

text•GRPO for reasoning (recap from Ch.23)
For a problem q, sample a group of G solution attempts {o₁..o_G}.
Verify each: rᵢ ∈ {0, 1}  (correct or not).

Advantage Aᵢ = (rᵢ - mean(r)) / std(r)

# Correct attempts get positive advantage, wrong ones negative.
# The group of attempts at the SAME problem is the baseline.

text•RLVR with GRPO for reasoning (Pseudocode)
# Models: policy (train), reference (frozen). NO reward model, NO value model.
for each batch of problems with known answers:
    for each problem q:
        sample a group of G solutions (with reasoning)
        verify each solution → reward 1 (correct) or 0 (wrong)
    compute group-relative advantages
    clipped GRPO update + KL penalty to reference
    # the model learns to produce reasoning that REACHES correct answers

✧

Reward Note: DeepSeek-R1: RL Producing a Reasoning Model

DeepSeek-R1 (2025) demonstrated this at scale: starting from a base model, they applied large-scale GRPO with purely verifiable rewards (math and code correctness), and the model LEARNED to reason — producing long chains of thought, checking its work, and backtracking. No reasoning traces were needed to teach the reasoning; the RL discovered it. This was a landmark result: reasoning can EMERGE from reinforcement learning on verifiable rewards alone.

This is the culmination of the RL thread running through Part V. The GRPO algorithm from Chapter 23, applied with verifiable rewards instead of a learned reward model, became the engine of the reasoning-model revolution. We examine the emergence more closely in Section 25.10.

Verifiable rewards check only the FINAL answer. But this has a subtle problem: a model can reach the right answer through FLAWED reasoning — lucky guesses, errors that cancel out. And for many problems we cannot fully verify the final answer automatically. This motivates a distinction between two kinds of reward model for reasoning: outcome-based and process-based.

ORM vs PRM

Type	Rewards	Trade-off
Outcome RM (ORM)	Only the final answer's correctness	Simple; but rewards lucky wrong reasoning
Process RM (PRM)	Each individual reasoning step	Catches flawed steps; needs step labels

An OUTCOME reward model (ORM) judges only the end result: was the final answer right? A PROCESS reward model (PRM) judges each STEP of the reasoning: is this step valid given the previous ones? The PRM gives much denser feedback — it can tell the model exactly WHERE its reasoning went wrong, not just that the final answer was wrong.

Why Process Rewards Help

The influential paper 'Let's Verify Step by Step' (Lightman et al., 2023) showed that PRMs substantially outperform ORMs for guiding reasoning. The reason is the credit-assignment problem: with only an outcome reward, the model cannot tell which of its ten reasoning steps was the bad one — the reward is the same for all of them. A PRM pinpoints the faulty step, giving precise feedback. It is the difference between 'your answer is wrong' and 'your answer is wrong, specifically at step 4'.

✧

Preference Pair

Prompt: Reasoning trace being judged by a process reward model

Chosen: Step 1: 16 / 2 = 8 golf balls. ✓ (PRM: valid) Step 2: 8 / 2 = 4 blue golf balls. ✓ (PRM: valid) Answer: 4. ✓

Rejected: Step 1: 16 / 2 = 8 golf balls. ✓ (PRM: valid) Step 2: 8 × 2 = 16 blue golf balls. ✗ (PRM: INVALID step — caught here!) Answer: 16. ✗

Notice how the PRM catches the error at the exact step where it occurs (step 2 used multiplication instead of division). An outcome reward would only know the final answer (16) is wrong, with no indication of where. The step-level feedback is far more useful for teaching the model to reason correctly.

The Cost of Process Rewards

PRMs are more powerful but more expensive: training one requires step-level labels — humans (or models) annotating whether each reasoning step is valid. This is far more labor-intensive than labeling only final answers. Recent work reduces this cost by generating step labels automatically (e.g. via how often a step leads to correct continuations), but the data cost remains the main barrier to PRMs.

✧

Reward Note: PRMs and Search Go Together

Process reward models are especially powerful combined with SEARCH (next section). Because a PRM can score PARTIAL reasoning (after each step), it can guide a search through the space of reasoning paths — expanding promising partial solutions and pruning bad ones, step by step. An outcome reward, available only at the end, cannot guide search mid-reasoning. This synergy between PRMs and search underlies many of the strongest reasoning systems.

Keep this connection in mind as we turn to search: the PRM is the evaluator that tells the search which partial reasoning paths are worth pursuing.

Self-consistency (Section 25.4) was a simple form of search: sample many paths and vote. More sophisticated search methods explore the space of reasoning paths more cleverly, guided by a reward model (often a PRM). These methods spend test-time compute strategically, focusing it on promising reasoning rather than scattering it randomly.

A Spectrum of Search Strategies

Method	How it searches	Compute use
Best-of-N	Sample N full solutions, pick the best by reward	Simple, parallel
Self-consistency	Sample N, majority-vote the answer	Simple, no reward model
Beam search	Keep the top-k partial reasoning paths at each step	Guided, breadth-limited
MCTS	Tree search: expand, simulate, backpropagate value	Powerful, complex

Best-of-N: The Simplest Search

Best-of-N generates N complete solutions and uses a reward model to pick the best one (in contrast to self-consistency, which votes). With a good outcome reward model, best-of-N is a strong, simple baseline that scales smoothly with N — more samples, better odds of a great solution. It is 'search' in the loosest sense: sample broadly, then select.

Beam Search Over Steps

Beam search explores reasoning STEP BY STEP. At each step, it keeps the top-k most promising partial reasoning paths (scored by a PRM), expands each by generating possible next steps, then prunes back to the top-k again. This focuses compute on promising paths rather than committing to a single chain (like CoT) or sampling blindly (like self-consistency). The PRM is essential here — it scores the partial paths to decide which to keep.

Monte Carlo Tree Search (MCTS)

MCTS — the algorithm behind AlphaGo — is the most powerful search method, treating reasoning as a tree where nodes are partial solutions and branches are possible next steps. It balances EXPLORATION (trying new branches) and EXPLOITATION (pursuing promising ones), using simulations to estimate which branches lead to correct answers. Applied to reasoning, MCTS can find solution paths that greedy or sampled approaches miss.

text•MCTS for reasoning (high level) (Pseudocode)
# Tree: nodes = partial reasoning states, edges = next steps
repeat for a compute budget:
    1. SELECT: walk the tree to a promising leaf (balance explore/exploit)
    2. EXPAND: generate candidate next reasoning steps
    3. SIMULATE: roll out to an answer (or score with a PRM/value model)
    4. BACKPROPAGATE: update value estimates up the tree
return the best reasoning path found

⚠️

Search Is Powerful but Expensive — and Not Always Worth It

Sophisticated search (beam, MCTS) can substantially boost reasoning accuracy, but at a steep compute cost — many model calls per problem, plus a reward model to guide the search. And it adds significant engineering complexity. A striking finding from the R1 work is that large-scale RL (GRPO) can produce a model that reasons so well ON ITS OWN that elaborate inference-time search adds little — the reasoning is baked into the weights, generated in a single (long) chain.

So there is a tension: invest compute in TRAINING the model to reason (RL), or in SEARCHING at inference (MCTS)? The field's center of gravity has shifted toward the former — train the reasoning in — with inference-time search reserved for squeezing out extra performance on the hardest problems. Match the method to the budget and the stakes.

One of the most remarkable findings in recent AI is that sophisticated reasoning behaviours can EMERGE from reinforcement learning on verifiable rewards — without ever being explicitly taught. The DeepSeek-R1 work documented this vividly, and it is worth understanding because it reveals something deep about how capabilities arise.

R1-Zero: Pure RL From a Base Model

DeepSeek's 'R1-Zero' experiment applied GRPO with verifiable rewards directly to a BASE model — no SFT on reasoning traces first, no demonstrations of how to reason. The only signal was: does the final answer match? Remarkably, over the course of RL training, the model spontaneously developed reasoning strategies: it began producing longer chains of thought, breaking problems into steps, and — most strikingly — CHECKING and CORRECTING its own work.

✧

The 'Aha Moment'

The DeepSeek-R1 paper described an 'aha moment' observed during training: the model spontaneously learned to PAUSE, RECONSIDER, and BACKTRACK — writing things like 'wait, let me reconsider this step' — without ever being shown such behaviour. This self-correction emerged purely because it led to more correct answers, and the RL reinforced whatever led to correct answers.

This is a profound demonstration: the RL did not teach the model HOW to reason; it rewarded CORRECT reasoning, and the model discovered effective strategies (decomposition, verification, backtracking) on its own. The capability was latent in the base model; verifiable-reward RL surfaced and sharpened it. It is the reasoning analogue of how RLHF surfaced helpfulness — but driven by an unhackable correctness signal.

Reasoning Length Grows On Its Own

A clean signature of this emergence: as RL training progresses, the model's reasoning chains get LONGER, with no explicit pressure to do so. The model learns that thinking more leads to more correct answers, so it naturally thinks more. This is test-time compute scaling (Section 25.5) arising endogenously from training — the model teaches ITSELF to use more inference compute because doing so earns more reward.

text•Emergent test-time scaling during RL
As RL training proceeds:
    reasoning chain length  ↑   (model learns to think more)
    accuracy on hard problems  ↑   (more thinking → more correct)

# No explicit length reward -- longer reasoning emerges because it WORKS.

▶

ML Connection: Why This Is a Big Deal

Emergent reasoning from RL matters for two reasons. First, practically: it means you can create powerful reasoning models without painstakingly collecting reasoning traces — the RL discovers the reasoning, given only answer-checking. Second, conceptually: it suggests that complex cognitive behaviours can arise from simple optimization pressure (reward correctness) applied to a capable base model, much as biological intelligence arose from the simple pressure of survival.

This connects to the emergence debate of Chapter 16: capabilities that look discontinuous and surprising can arise from smooth optimization. Reasoning, it turns out, is one such capability — not hand-built, but grown under the right reward.

Let us consolidate how the pieces fit into the reasoning models you can actually use, and the practical considerations of deploying them.

The Modern Reasoning Recipe

Pipeline Flow: How a modern reasoning model is built

1	Base + SFT	Start from a pretrained, instruction-tuned model
2	Cold-start traces	Optionally SFT on some high-quality reasoning traces first
3	RLVR (GRPO)	Large-scale RL with verifiable rewards — the reasoning engine
4	Distill	Distill the reasoning into smaller models via SFT on traces
5	Deploy	Serve with a controllable 'thinking' budget per query

Reasoning Tokens and the Thinking Budget

In deployment, reasoning models generate a (often hidden) chain of 'thinking' tokens before the final answer. The amount of thinking is a controllable budget: more thinking for hard problems, less for easy ones. Systems may let the user or the model itself decide how long to think. This is the test-time compute dial made practical — you pay for extra reasoning only when the problem warrants it.

When to Use a Reasoning Model

Use a reasoning model	A standard model is fine
Math, logic, multi-step problems	Simple factual lookup
Competitive coding, proofs	Casual conversation
Complex planning	Summarization, rewriting
Problems where correctness is critical	Quick, low-stakes responses
Worth paying for extra thinking	Latency and cost matter most

✧

Reward Note: Reasoning Has a Cost

Reasoning models are slower and more expensive per query, because they generate many thinking tokens before answering. For a simple question, this is wasteful — you pay for thousands of tokens of deliberation to answer something that needed none. The practical skill is matching the tool to the task: use reasoning models for genuinely hard problems, and faster standard models for everything else.

Many systems now route automatically — detecting hard queries and engaging more reasoning, while answering easy ones quickly. This 'adaptive compute' is the natural endpoint of test-time scaling: spend thinking where it pays off.

Reasoning models are a major advance, but the area is young and full of open questions and limitations. A clear-eyed view of these is part of understanding the technology honestly.

Is the Reasoning Faithful?

A deep and unsettling question: does the model's written reasoning actually reflect HOW it reached its answer? Studies suggest the chain of thought is not always FAITHFUL — the model may reach an answer by other means and produce a plausible-looking rationalization, or its stated reasoning may not be what actually drove the conclusion. This matters enormously for trust and safety: if we rely on the reasoning chain to understand or verify the model's thinking, but the chain is not faithful, we are misled.

⚠️

Chain-of-Thought May Not Show the Real Reasoning

It is tempting to treat a model's reasoning chain as a window into its 'thought process'. But the chain is generated text, optimized to lead to good answers — not a guaranteed transcript of the model's actual computation. A model can produce correct-looking reasoning that does not match how it really arrived at the answer, or can be influenced by factors it does not mention. Faithfulness of CoT is an active and important research area, especially for safety (Chapter 26).

The practical caution: do not over-trust the reasoning chain as an explanation. It is useful and often illuminating, but it is not a verified account of the model's internal process. Treat it as a helpful artifact, not ground truth about the model's cognition.

Overthinking and Other Failure Modes

Issue	What happens
Overthinking	Model rambles for thousands of tokens on easy problems, wasting compute
Unfaithful CoT	Stated reasoning doesn't match the actual basis for the answer
Reasoning then ignoring it	Model produces good reasoning, then gives a contradicting answer
Verifiable-only	RLVR works for math/code; hard to extend to open-ended tasks
Cost	Long reasoning is slow and expensive per query
Reward hacking (subtle)	Even verifiable rewards can be gamed (e.g. guessing formats)

The Frontier of Reasoning

Key open questions remain. Can verifiable-reward RL extend beyond math and code to open-ended reasoning where correctness is not checkable? How do we make reasoning faithful and trustworthy? How much can test-time compute scale before it plateaus? Can reasoning models reason about their own uncertainty and know when to think more? These questions are at the active frontier, and Chapter 35 (Open Problems) returns to several of them.

▶

ML Connection: Reasoning and the Future of Scaling

Test-time compute scaling has partly relieved the pressure of the training-time data wall (Chapter 16): instead of needing ever-more pretraining data, the field can invest in teaching models to reason and to use inference compute well. This has reshaped the trajectory of the field — the frontier of 2024–2025 was defined by reasoning, not just bigger pretraining.

Where this leads is one of the most important open questions in AI. Reasoning turned 'how long the model thinks' into a capability dial, and we are still discovering how far that dial can turn. It is a fitting note on which to approach the final alignment chapter and, later, the open problems of Part VII.

Reasoning Quick-Reference

Concept	Key idea	Remember
Chain-of-thought	Reason step by step before answering	Tokens = computation
Zero-shot CoT	'Let's think step by step'	Elicits latent reasoning
Self-consistency	Sample N paths, majority-vote	Test-time compute in disguise
Test-time scaling	Think longer for harder problems	New axis beyond model size
RLVR	RL with verifiable correctness rewards	No reward model to hack
GRPO for reasoning	Group of attempts, verify, reinforce	The R1 engine
ORM vs PRM	Final answer vs per-step rewards	PRMs give denser feedback
Search	Best-of-N, beam, MCTS	Spend compute strategically
Emergent reasoning	RL discovers reasoning unprompted	The R1 'aha moment'

Exercises

Exercises 1–11 are pen-and-paper or derivations; 12–22 require code.

✎

Exercise 1: Pen & Paper

Explain why a model forced to answer immediately struggles with multi-step problems. Connect your answer to the fixed computation of a single forward pass.

✎

Exercise 2: Pen & Paper

Explain the System 1 / System 2 analogy for reasoning. Which mode does direct answering use, and which does chain-of-thought provide?

✎

Exercise 3: Pen & Paper

Give two distinct reasons chain-of-thought improves accuracy (the computation argument and the decomposition argument). Illustrate each with an example.

✎

Exercise 4: Pen & Paper

Why does CoT not teach the model anything new? Connect to the superficial alignment hypothesis from Chapter 22.

✎

Exercise 5: Pen & Paper

Explain self-consistency. Why does majority voting over sampled reasoning paths beat a single greedy path? When does it NOT apply?

✎

Exercise 6: Pen & Paper

Describe test-time compute scaling and contrast it with train-time scaling (Chapter 16). Why is it a controllable, per-query dial?

✎

Exercise 7: Pen & Paper

Explain RLVR. Why does a verifiable reward avoid the reward-hacking problem that plagues RLHF (Chapter 23)?

✎

Exercise 8: Pen & Paper

Explain why GRPO is well-suited to reasoning with verifiable rewards. What does the group provide, and why is no value or reward model needed?

✎

Exercise 9: Pen & Paper

Compare outcome (ORM) and process (PRM) reward models. Explain the credit-assignment argument for why PRMs help more.

✎

Exercise 10: Pen & Paper

Describe one search method (best-of-N, beam, or MCTS) over reasoning paths. How does a PRM guide step-level search, and why can't an ORM?

✎

Exercise 11: Pen & Paper

Explain the 'aha moment' and emergent reasoning in R1-style training. Why is it significant that reasoning emerged from RL without demonstrations?

✎

Exercise 12: Code

Implement zero-shot and few-shot CoT prompting for a small model on arithmetic word problems. Compare accuracy with and without the reasoning prompt.

✎

Exercise 13: Code

Implement self-consistency: sample N CoT paths, parse the final answers, and majority-vote. Plot accuracy as a function of N.

✎

Exercise 14: Code

Build a test-time scaling curve: measure accuracy on a reasoning benchmark as you increase reasoning length and/or sample count. Confirm accuracy rises with compute.

✎

Exercise 15: Code

Implement a verifiable reward for arithmetic: parse the model's final answer and check it against ground truth. Return 1/0. Test it on a set of generated solutions.

✎

Exercise 16: Code Lab

Build a rejection-sampling reasoning dataset: generate many CoT solutions per problem, keep only verified-correct ones, and SFT a model on them. Show improved reasoning accuracy.

✎

Exercise 17: Code Lab

Implement RLVR with your GRPO loop from Chapter 23: for each problem, sample a group, verify correctness, compute group-relative advantages, and update. Track accuracy and reasoning length over training.

✎

Exercise 18: Code

Implement a simple process reward model interface: score each step of a reasoning trace as valid/invalid. Use it to re-rank a set of candidate solutions and compare to outcome-only scoring.

✎

Exercise 19: Code

Implement best-of-N with an outcome reward model: generate N solutions, score each, return the best. Compare its accuracy to self-consistency at the same N.

✎

Exercise 20: Code

Implement step-level beam search over reasoning: at each step keep the top-k partial paths scored by a (mock) PRM. Compare to greedy CoT on a multi-step task.

✎

Exercise 21: Code

Measure overthinking: on a set of easy and hard problems, compare reasoning length and accuracy. Show the model wastes tokens on easy problems and propose a stopping heuristic.

✎

Exercise 22: Code (Challenge)

Build a complete mini reasoning-model pipeline: (1) start from a small instruction-tuned model, (2) build a verifier for a math task, (3) train with GRPO + verifiable rewards, (4) observe whether reasoning length grows and accuracy improves over training, and (5) compare the RL-trained model against pure self-consistency and best-of-N at matched inference compute. Write up whether the RL 'baked in' the reasoning and how it traded against inference-time search.

Further reading: “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” (Wei et al., 2022) and “Large Language Models are Zero-Shot Reasoners” (Kojima et al., 2022). “Self-Consistency Improves Chain of Thought Reasoning” (Wang et al., 2022). “Let's Verify Step by Step” (Lightman et al., 2023) for process reward models. “Scaling LLM Test-Time Compute” (Snell et al., 2024). The DeepSeek-R1 report (2025) for GRPO-driven emergent reasoning and the 'aha moment'. The OpenAI o1 system card for the test-time-compute paradigm. “Measuring Faithfulness in Chain-of-Thought Reasoning” (Lanham et al., 2023).

Next → Chapter 26: Constitutional AI & Safety

You have now built a model that follows instructions (Chapter 22), aligns with human preferences (Chapters 23–24), and reasons through hard problems (this chapter). The final piece of Part V is making it SAFE and trustworthy. Chapter 26 covers Constitutional AI — using a written set of principles and AI feedback (RLAIF) to make a model harmless without exhaustive human labeling — along with red-teaming, jailbreak resistance, and the broader project of aligning models with human values. It closes Part V by turning a capable, helpful, reasoning assistant into one that is also honest and harmless — the full 'helpful, harmless, honest' ideal.

✎ 22 Exercises in this chapter

Attempt each exercise before checking the worked solutions.

View Solutions →

←

PreviousCh 24. Direct Preference Optimization & Beyond

NextCh 26. Constitutional AI & Safety Techniques

→