Reasoning & Chain-of-Thought
So far, our models answer in one shot: given a prompt, they generate a response directly. For many tasks this is fine. But for problems that require multiple steps of REASONING — a math word problem, a logic puzzle, a multi-hop question — direct answering fails surprisingly often, even for very capable models. This chapter is about giving models room to THINK before they answer, and the dramatic capability gains that follow.
The Core Difficulty
Consider asking a model: 'A juggler has 16 balls. Half are golf balls, and half of the golf balls are blue. How many blue golf balls are there?' To answer correctly, you must compute 16/2 = 8 golf balls, then 8/2 = 4 blue golf balls. That is two sequential steps. A model forced to emit the answer immediately has to do BOTH steps in a single forward pass, with no room to write down the intermediate result — and it often gets it wrong.
The rejected answer (8) is a classic error: the model grabbed an intermediate number without finishing the computation. The chosen answer works through the steps and gets 4. The difference is not knowledge — both 'know' the arithmetic — it is whether the model was given room to perform the steps sequentially.
A Useful Analogy: System 1 and System 2
Psychologists describe two modes of human thinking: System 1 (fast, automatic, intuitive) and System 2 (slow, deliberate, step-by-step). Answering '2+2' uses System 1; working out a long-division problem uses System 2. A model answering directly is like a person forced to blurt out the first thing that comes to mind — pure System 1. Reasoning techniques give the model a System 2 mode: the ability to slow down and work through a problem deliberately.
The simplest way to give a model room to reason is also one of the most important discoveries in prompting: chain-of-thought (CoT). Instead of asking for the answer directly, you prompt the model to produce its REASONING first, then the answer. Astonishingly, this single change dramatically improves performance on reasoning tasks — with no training at all.
Few-Shot Chain-of-Thought
Chain-of-thought prompting was introduced by Wei et al. (2022). The idea: in your few-shot examples (Chapter 21), show not just the answer but the step-by-step reasoning that leads to it. The model then imitates this pattern, producing its own reasoning steps before answering the new question.
# A few-shot example that DEMONSTRATES reasoning, not just the answer:
Q: Roger has 5 tennis balls. He buys 2 cans of 3 balls each. How many now?
A: Roger started with 5 balls. 2 cans of 3 balls each is 2 × 3 = 6 balls.
5 + 6 = 11. The answer is 11.
# Now the model imitates the step-by-step pattern on a NEW question:
Q: A cafe had 23 apples. They used 20 and bought 6 more. How many now?
A: <- the model now generates reasoning before answering
The cafe started with 23 apples. They used 20, leaving 23 - 20 = 3.
They bought 6 more: 3 + 6 = 9. The answer is 9.Zero-Shot Chain-of-Thought
Even simpler: Kojima et al. (2022) found that just appending the phrase 'Let's think step by step' to a prompt triggers reasoning behaviour, with NO examples needed. This 'zero-shot CoT' is remarkably effective — a single magic phrase that flips the model from System-1 blurting to System-2 deliberation.
# Without the magic phrase -- model often answers wrong:
Q: A juggler has 16 balls, half golf balls, half of those blue. Blue golf balls?
A: -> '8' (wrong, grabbed an intermediate number)
# WITH the magic phrase -- model reasons and gets it right:
Q: A juggler has 16 balls, half golf balls, half of those blue. Blue golf balls?
A: Let's think step by step.
-> '16 / 2 = 8 golf balls. 8 / 2 = 4 blue. The answer is 4.' (correct)
# The phrase costs nothing but unlocks step-by-step computation.Why does generating reasoning steps help so much? The answer connects back to the intuition from Section 25.1 and reveals something deep about how these models compute. Understanding it makes everything later in the chapter clearer.
Reasoning Tokens Are Extra Computation
Recall that each generated token is one forward pass — one fixed increment of computation. When the model generates a chain of reasoning, each reasoning token is an additional forward pass, and crucially, each one can ATTEND to all the previous reasoning tokens. So the model builds up a solution incrementally: step one is computed and written down, step two attends to step one, step three attends to both, and so on. The reasoning chain is a SCRATCHPAD that holds intermediate results the model can build on.
Direct answer: 1 forward pass must solve the whole problem.
CoT with k reasoning tokens: k+1 forward passes,
each attending to all prior reasoning — a working scratchpad.
More reasoning tokens → more total computation → harder problems solvable.Decomposing Hard Problems Into Easy Steps
There is a second reason CoT works. A hard problem is often a chain of easy steps. Each individual step — '16 / 2 = 8' — is something the model can do reliably in one forward pass. By breaking the problem into these easy steps and writing each down, the model converts one hard problem (which it would fail) into a sequence of easy problems (which it can each solve). The scratchpad lets it tackle them one at a time.
Chain-of-thought generates ONE reasoning path. But a single path can go wrong — a careless slip early on dooms the whole chain. Self-consistency (Wang et al., 2022) improves reliability with a simple idea: generate MANY reasoning paths (by sampling), then take the most common final answer. It is the reasoning equivalent of asking many people and going with the majority.
How Self-Consistency Works
# Generate diverse reasoning paths, then vote on the answer
1. sample N chain-of-thought responses (with temperature > 0)
2. extract the final answer from each path
3. return the MOST COMMON final answer (majority vote)
# Different paths may reason differently but converge on the truth;
# errors tend to be idiosyncratic and get out-voted.The intuition: there are many correct ways to reason to the right answer, but each wrong path tends to err in its own idiosyncratic way. So the correct answer recurs across many paths (they converge on it), while the wrong answers scatter. Majority voting surfaces the convergent, correct answer and out-votes the scattered errors.
from collections import Counter
def self_consistency(model, prompt, n_samples=16, temperature=0.7):
"""Sample N reasoning paths and majority-vote the final answer."""
answers = []
for _ in range(n_samples):
# Sample a full chain-of-thought (temperature gives diversity)
cot = model.generate(prompt + ' Let\'s think step by step.',
temperature=temperature)
ans = extract_final_answer(cot) # parse 'The answer is X'
answers.append(ans)
# Majority vote over the final answers
return Counter(answers).most_common(1)[0][0]
# More samples -> higher accuracy, up to a point. On math benchmarks,
# self-consistency with 16-40 paths can add many points over single CoT.
# This is our first taste of TRADING INFERENCE COMPUTE for accuracy.We now reach the central idea of modern reasoning. Chapter 16's scaling laws were about TRAINING compute — bigger models, more data. But CoT and self-consistency hint at a SECOND scaling axis: TEST-TIME compute. Instead of (or in addition to) training a bigger model, you let the model think LONGER at inference. The o1 models from OpenAI (2024) and the R1 models from DeepSeek (2025) made this the defining paradigm of frontier reasoning.
Two Axes of Scaling
| Train-time scaling (Chapter 16) | Test-time scaling (this chapter) |
|---|---|
| Bigger model, more data | More reasoning at inference |
| Expensive, one-time cost | Cost per query, controllable |
| Fixed capability after training | Capability scales with thinking |
| The 2018–2023 paradigm | The 2024+ reasoning paradigm |
| 6ND FLOPs (Chapter 16) | FLOPs × reasoning length |
| Better weights | Better use of the weights |
The Test-Time Scaling Curve
The striking empirical finding: for reasoning tasks, accuracy improves smoothly and predictably as you allow more test-time compute — longer reasoning chains, more samples, more search. Just as training-loss falls predictably with training compute (Chapter 16), task accuracy rises predictably with test-time compute. This gave the field a new, controllable dial: spend more per query to get a better answer.
accuracy ≈ increasing function of (test-time compute)
test-time compute ≈ (reasoning tokens) × (samples or search width)
# More thinking → higher accuracy, smoothly, up to a ceiling.
# You can 'buy' accuracy at inference, per query.Why This Changes the Game
Test-time compute scaling matters because it partially sidesteps the data wall and the cost of ever-bigger pretraining (Chapter 16). Rather than spending more and more to train a bigger model, you can invest in teaching a model to USE inference compute well — to reason, search, and verify. The rest of this chapter is about how to TRAIN models to do this, since prompting alone (CoT, self-consistency) only goes so far.
Prompting (CoT, self-consistency) elicits reasoning that is already latent in the model. But we can do far better by TRAINING the model to reason — making step-by-step thinking a deeply learned default rather than something coaxed out by a prompt. There are two main training approaches, building on Part V's earlier methods.
Approach 1: SFT on Reasoning Traces
The most direct method is supervised fine-tuning (Chapter 22) on examples that include full reasoning traces. Collect problems paired with detailed step-by-step solutions, and fine-tune the model to produce reasoning-then-answer. This teaches the model to reason by default, even without a 'think step by step' prompt. The reasoning traces can come from humans, from a stronger model (distillation), or from the model's own correct solutions (filtered by whether they got the right answer).
Pipeline Flow: Building a reasoning dataset by rejection sampling
| 1 | Sample | Generate many CoT solutions per problem from the model |
| 2 | Verify | Keep only solutions that reach the CORRECT final answer |
| 3 | Filter | Optionally keep only clean, concise correct traces |
| 4 | SFT | Fine-tune the model on these verified reasoning traces |
Approach 2: Reinforcement Learning for Reasoning
SFT on traces is limited by the quality of the traces — the same imitation ceiling we saw in Chapter 23. The more powerful approach is REINFORCEMENT LEARNING: let the model generate its own reasoning, reward it when it reaches correct answers, and let it discover effective reasoning strategies on its own. This is where the RL methods of Chapters 23–24 — especially GRPO — return to center stage, and it is the subject of the next section.
The most important advance in reasoning training is reinforcement learning with VERIFIABLE rewards (RLVR). The key realization: for many reasoning tasks — math, coding, logic — we can CHECK whether the answer is correct AUTOMATICALLY, with no human or learned reward model needed. Math answers can be verified; code can be run against tests. This gives a clean, cheap, hack-resistant reward signal, and it is the perfect setting for the RL methods of Chapter 23.
Verifiable Rewards: No Reward Model Needed
Recall the central weakness of RLHF (Chapter 23): the reward model is an imperfect, hackable proxy. For verifiable tasks, that weakness disappears. The reward is simply whether the answer is CORRECT — computed by a checker, not a learned model. There is nothing to hack: a wrong answer gets zero reward no matter how it is dressed up. This makes RL for reasoning far more robust than RLHF for general alignment.
r(response) = 1 if the final answer is correct
= 0 otherwise
# For math: does the boxed answer match the ground truth?
# For code: do the unit tests pass?
# No learned reward model -> no reward hacking of the proxy.GRPO for Reasoning
GRPO (Chapter 23) is ideally suited to this setting, which is exactly why DeepSeek developed it for reasoning. For each problem, the model generates a GROUP of solution attempts; each is verified (correct or not); the group-relative advantage rewards the attempts that succeeded relative to the group. No value model, no reward model — just generate a group, check correctness, and reinforce what worked. Recall the GRPO advantage from Chapter 23:
For a problem q, sample a group of G solution attempts {o₁..o_G}.
Verify each: rᵢ ∈ {0, 1} (correct or not).
Advantage Aᵢ = (rᵢ - mean(r)) / std(r)
# Correct attempts get positive advantage, wrong ones negative.
# The group of attempts at the SAME problem is the baseline.# Models: policy (train), reference (frozen). NO reward model, NO value model.
for each batch of problems with known answers:
for each problem q:
sample a group of G solutions (with reasoning)
verify each solution → reward 1 (correct) or 0 (wrong)
compute group-relative advantages
clipped GRPO update + KL penalty to reference
# the model learns to produce reasoning that REACHES correct answersVerifiable rewards check only the FINAL answer. But this has a subtle problem: a model can reach the right answer through FLAWED reasoning — lucky guesses, errors that cancel out. And for many problems we cannot fully verify the final answer automatically. This motivates a distinction between two kinds of reward model for reasoning: outcome-based and process-based.
ORM vs PRM
| Type | Rewards | Trade-off |
|---|---|---|
| Outcome RM (ORM) | Only the final answer's correctness | Simple; but rewards lucky wrong reasoning |
| Process RM (PRM) | Each individual reasoning step | Catches flawed steps; needs step labels |
An OUTCOME reward model (ORM) judges only the end result: was the final answer right? A PROCESS reward model (PRM) judges each STEP of the reasoning: is this step valid given the previous ones? The PRM gives much denser feedback — it can tell the model exactly WHERE its reasoning went wrong, not just that the final answer was wrong.
Why Process Rewards Help
The influential paper 'Let's Verify Step by Step' (Lightman et al., 2023) showed that PRMs substantially outperform ORMs for guiding reasoning. The reason is the credit-assignment problem: with only an outcome reward, the model cannot tell which of its ten reasoning steps was the bad one — the reward is the same for all of them. A PRM pinpoints the faulty step, giving precise feedback. It is the difference between 'your answer is wrong' and 'your answer is wrong, specifically at step 4'.
Notice how the PRM catches the error at the exact step where it occurs (step 2 used multiplication instead of division). An outcome reward would only know the final answer (16) is wrong, with no indication of where. The step-level feedback is far more useful for teaching the model to reason correctly.
The Cost of Process Rewards
PRMs are more powerful but more expensive: training one requires step-level labels — humans (or models) annotating whether each reasoning step is valid. This is far more labor-intensive than labeling only final answers. Recent work reduces this cost by generating step labels automatically (e.g. via how often a step leads to correct continuations), but the data cost remains the main barrier to PRMs.
Self-consistency (Section 25.4) was a simple form of search: sample many paths and vote. More sophisticated search methods explore the space of reasoning paths more cleverly, guided by a reward model (often a PRM). These methods spend test-time compute strategically, focusing it on promising reasoning rather than scattering it randomly.
A Spectrum of Search Strategies
| Method | How it searches | Compute use |
|---|---|---|
| Best-of-N | Sample N full solutions, pick the best by reward | Simple, parallel |
| Self-consistency | Sample N, majority-vote the answer | Simple, no reward model |
| Beam search | Keep the top-k partial reasoning paths at each step | Guided, breadth-limited |
| MCTS | Tree search: expand, simulate, backpropagate value | Powerful, complex |
Best-of-N: The Simplest Search
Best-of-N generates N complete solutions and uses a reward model to pick the best one (in contrast to self-consistency, which votes). With a good outcome reward model, best-of-N is a strong, simple baseline that scales smoothly with N — more samples, better odds of a great solution. It is 'search' in the loosest sense: sample broadly, then select.
Beam Search Over Steps
Beam search explores reasoning STEP BY STEP. At each step, it keeps the top-k most promising partial reasoning paths (scored by a PRM), expands each by generating possible next steps, then prunes back to the top-k again. This focuses compute on promising paths rather than committing to a single chain (like CoT) or sampling blindly (like self-consistency). The PRM is essential here — it scores the partial paths to decide which to keep.
Monte Carlo Tree Search (MCTS)
MCTS — the algorithm behind AlphaGo — is the most powerful search method, treating reasoning as a tree where nodes are partial solutions and branches are possible next steps. It balances EXPLORATION (trying new branches) and EXPLOITATION (pursuing promising ones), using simulations to estimate which branches lead to correct answers. Applied to reasoning, MCTS can find solution paths that greedy or sampled approaches miss.
# Tree: nodes = partial reasoning states, edges = next steps
repeat for a compute budget:
1. SELECT: walk the tree to a promising leaf (balance explore/exploit)
2. EXPAND: generate candidate next reasoning steps
3. SIMULATE: roll out to an answer (or score with a PRM/value model)
4. BACKPROPAGATE: update value estimates up the tree
return the best reasoning path foundOne of the most remarkable findings in recent AI is that sophisticated reasoning behaviours can EMERGE from reinforcement learning on verifiable rewards — without ever being explicitly taught. The DeepSeek-R1 work documented this vividly, and it is worth understanding because it reveals something deep about how capabilities arise.
R1-Zero: Pure RL From a Base Model
DeepSeek's 'R1-Zero' experiment applied GRPO with verifiable rewards directly to a BASE model — no SFT on reasoning traces first, no demonstrations of how to reason. The only signal was: does the final answer match? Remarkably, over the course of RL training, the model spontaneously developed reasoning strategies: it began producing longer chains of thought, breaking problems into steps, and — most strikingly — CHECKING and CORRECTING its own work.
Reasoning Length Grows On Its Own
A clean signature of this emergence: as RL training progresses, the model's reasoning chains get LONGER, with no explicit pressure to do so. The model learns that thinking more leads to more correct answers, so it naturally thinks more. This is test-time compute scaling (Section 25.5) arising endogenously from training — the model teaches ITSELF to use more inference compute because doing so earns more reward.
As RL training proceeds:
reasoning chain length ↑ (model learns to think more)
accuracy on hard problems ↑ (more thinking → more correct)
# No explicit length reward -- longer reasoning emerges because it WORKS.Let us consolidate how the pieces fit into the reasoning models you can actually use, and the practical considerations of deploying them.
The Modern Reasoning Recipe
Pipeline Flow: How a modern reasoning model is built
| 1 | Base + SFT | Start from a pretrained, instruction-tuned model |
| 2 | Cold-start traces | Optionally SFT on some high-quality reasoning traces first |
| 3 | RLVR (GRPO) | Large-scale RL with verifiable rewards — the reasoning engine |
| 4 | Distill | Distill the reasoning into smaller models via SFT on traces |
| 5 | Deploy | Serve with a controllable 'thinking' budget per query |
Reasoning Tokens and the Thinking Budget
In deployment, reasoning models generate a (often hidden) chain of 'thinking' tokens before the final answer. The amount of thinking is a controllable budget: more thinking for hard problems, less for easy ones. Systems may let the user or the model itself decide how long to think. This is the test-time compute dial made practical — you pay for extra reasoning only when the problem warrants it.
When to Use a Reasoning Model
| Use a reasoning model | A standard model is fine |
|---|---|
| Math, logic, multi-step problems | Simple factual lookup |
| Competitive coding, proofs | Casual conversation |
| Complex planning | Summarization, rewriting |
| Problems where correctness is critical | Quick, low-stakes responses |
| Worth paying for extra thinking | Latency and cost matter most |
Reasoning models are a major advance, but the area is young and full of open questions and limitations. A clear-eyed view of these is part of understanding the technology honestly.
Is the Reasoning Faithful?
A deep and unsettling question: does the model's written reasoning actually reflect HOW it reached its answer? Studies suggest the chain of thought is not always FAITHFUL — the model may reach an answer by other means and produce a plausible-looking rationalization, or its stated reasoning may not be what actually drove the conclusion. This matters enormously for trust and safety: if we rely on the reasoning chain to understand or verify the model's thinking, but the chain is not faithful, we are misled.
Overthinking and Other Failure Modes
| Issue | What happens |
|---|---|
| Overthinking | Model rambles for thousands of tokens on easy problems, wasting compute |
| Unfaithful CoT | Stated reasoning doesn't match the actual basis for the answer |
| Reasoning then ignoring it | Model produces good reasoning, then gives a contradicting answer |
| Verifiable-only | RLVR works for math/code; hard to extend to open-ended tasks |
| Cost | Long reasoning is slow and expensive per query |
| Reward hacking (subtle) | Even verifiable rewards can be gamed (e.g. guessing formats) |
The Frontier of Reasoning
Key open questions remain. Can verifiable-reward RL extend beyond math and code to open-ended reasoning where correctness is not checkable? How do we make reasoning faithful and trustworthy? How much can test-time compute scale before it plateaus? Can reasoning models reason about their own uncertainty and know when to think more? These questions are at the active frontier, and Chapter 35 (Open Problems) returns to several of them.
Reasoning Quick-Reference
| Concept | Key idea | Remember |
|---|---|---|
| Chain-of-thought | Reason step by step before answering | Tokens = computation |
| Zero-shot CoT | 'Let's think step by step' | Elicits latent reasoning |
| Self-consistency | Sample N paths, majority-vote | Test-time compute in disguise |
| Test-time scaling | Think longer for harder problems | New axis beyond model size |
| RLVR | RL with verifiable correctness rewards | No reward model to hack |
| GRPO for reasoning | Group of attempts, verify, reinforce | The R1 engine |
| ORM vs PRM | Final answer vs per-step rewards | PRMs give denser feedback |
| Search | Best-of-N, beam, MCTS | Spend compute strategically |
| Emergent reasoning | RL discovers reasoning unprompted | The R1 'aha moment' |
Exercises
Exercises 1–11 are pen-and-paper or derivations; 12–22 require code.
Further reading: “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” (Wei et al., 2022) and “Large Language Models are Zero-Shot Reasoners” (Kojima et al., 2022). “Self-Consistency Improves Chain of Thought Reasoning” (Wang et al., 2022). “Let's Verify Step by Step” (Lightman et al., 2023) for process reward models. “Scaling LLM Test-Time Compute” (Snell et al., 2024). The DeepSeek-R1 report (2025) for GRPO-driven emergent reasoning and the 'aha moment'. The OpenAI o1 system card for the test-time-compute paradigm. “Measuring Faithfulness in Chain-of-Thought Reasoning” (Lanham et al., 2023).
Next → Chapter 26: Constitutional AI & Safety
You have now built a model that follows instructions (Chapter 22), aligns with human preferences (Chapters 23–24), and reasons through hard problems (this chapter). The final piece of Part V is making it SAFE and trustworthy. Chapter 26 covers Constitutional AI — using a written set of principles and AI feedback (RLAIF) to make a model harmless without exhaustive human labeling — along with red-teaming, jailbreak resistance, and the broader project of aligning models with human values. It closes Part V by turning a capable, helpful, reasoning assistant into one that is also honest and harmless — the full 'helpful, harmless, honest' ideal.