Reasoning & Chain-of-Thought
Detailed solutions for the exercises in Chapter 25. Try solving them yourself before checking the answers.
Solution
A single forward pass applies a FIXED amount of computation (one pass through the layers) before emitting the next token, so a problem needing many sequential steps of reasoning cannot be solved in that fixed budget if the answer must come immediately. Chain-of-thought lets the model spread the computation across many generated tokens — each token is another forward pass — effectively giving it more serial computation for harder problems. Forcing immediacy caps the usable compute per problem.
Solution
System 1 is fast, automatic, intuitive; System 2 is slow, deliberate, step-by-step (Kahneman). Direct answering is System 1 — a single intuitive leap to the answer. Chain-of-thought is System 2 — working through intermediate steps deliberately. CoT lets the model engage a 'slow thinking' mode for problems where intuition alone is unreliable, mirroring how humans switch to careful reasoning for hard problems.
Solution
(1) Computation: each reasoning token is another forward pass, so CoT grants more total compute for the problem — e.g. carrying out a long multiplication digit by digit rather than guessing the product. (2) Decomposition: breaking a problem into sub-steps makes each step easy and lets later steps condition on earlier results — e.g. a word problem solved by first extracting quantities, then forming an equation, then solving it. More compute AND easier sub-problems both raise accuracy.
Solution
CoT prompting elicits reasoning the model is already CAPABLE of (from pretraining) — it changes the inference-time process, not the weights, so no new knowledge is added. This mirrors the superficial alignment hypothesis (Chapter 22): capabilities live in the pretrained model, and prompting/alignment merely surfaces them. CoT unlocks latent reasoning ability rather than creating it.
Solution
Self-consistency samples multiple independent CoT paths and takes the majority answer. It beats greedy decoding because different paths make different errors, but correct reasoning tends to converge on the same answer, so the majority is more reliable than any single (possibly flawed) chain. It does NOT apply when there is no well-defined final answer to vote on (open-ended generation) or when all paths share the same systematic bias (they'd agree on the wrong answer).
Solution
Test-time scaling spends more compute at INFERENCE — longer reasoning, more samples, search — to improve a single answer, whereas train-time scaling (Chapter 16) spends compute once to make a better model. Test-time compute is a per-query dial: you can choose to think longer on a hard question and answer cheaply on an easy one, allocating compute where it helps. It trades inference cost for accuracy, controllable on the fly.
Solution
RLVR (RL from Verifiable Rewards) trains on tasks where correctness can be checked programmatically — math answers, passing unit tests — giving a reward of 1/0 from ground truth rather than a learned reward model. Because the reward is the actual objective (correctness), not an imperfect proxy, there is nothing to 'hack': the only way to score is to be genuinely correct. This sidesteps the Goodhart/reward-hacking problem (Chapter 23) that plagues learned reward models.
Solution
With verifiable rewards, each sampled solution gets a clean 0/1 score. GRPO samples a GROUP of solutions per problem and uses their mean as the baseline, so a solution's advantage is simply how it compares to its peers — no value model or reward model needed. This fits reasoning perfectly: the group of attempts at one problem provides the baseline, the verifier provides the reward, and group-relative advantages reward the solutions that solved it among many attempts.
Solution
An Outcome Reward Model scores only the FINAL answer; a Process Reward Model scores each intermediate STEP. PRMs help more because of credit assignment: when a long reasoning chain fails, an ORM only knows the whole thing was wrong, not WHERE; a PRM pinpoints which step went wrong, giving denser, more localized signal. This lets training (or search) reward correct intermediate steps and catch errors early, rather than only judging the end.
Solution
Beam search over reasoning keeps the top-k partial reasoning paths at each step, expanding and re-scoring them. A PRM can score each PARTIAL path step-by-step, so it guides which partial paths to keep BEFORE the answer exists — enabling step-level pruning. An ORM only scores complete answers, so it has nothing to say about partial paths and cannot guide the search until the end — which is why step-level search needs a process reward.
Solution
In R1-style RLVR training, models spontaneously began producing longer reasoning, backtracking, and self-correction (the 'aha moment') without being shown any reasoning demonstrations — the behavior emerged purely from optimizing for correct answers. This is significant because it shows sophisticated reasoning can arise from a simple verifiable-reward signal alone, not requiring expensive human reasoning traces — reasoning was elicited and amplified by RL, suggesting the capability was latent and reward-shapable.
Solution
Adding 'Let's think step by step' (zero-shot CoT) or worked examples (few-shot CoT) substantially raises accuracy on multi-step arithmetic versus answering directly — demonstrating the computation/decomposition benefits of Exercise 3 in practice.
Solution
Majority-voting over N sampled chains improves accuracy over a single path, with accuracy rising as N grows and then saturating (Exercise 5). The curve quantifies the diminishing returns of more samples — a form of test-time compute scaling.
Solution
Plotting accuracy against the inference compute spent (longer reasoning or more samples) shows it rising with compute — the controllable per-query dial of Exercise 6, demonstrating that thinking longer measurably helps on reasoning tasks.
Solution
Parsing the model's final answer and comparing to ground truth yields a clean 1/0 reward (Exercise 7). Testing it confirms it correctly rewards only genuinely correct solutions — the verifier that makes RLVR hack-proof.
Solution
Generating many CoT solutions, keeping only verified-correct ones, and SFT-ing on them distills correct reasoning into the model, improving its accuracy — a simple, stable alternative to RL (the reasoning analogue of best-of-N from Chapter 23).
Solution
Running GRPO with the verifiable reward (Exercises 7–8) shows accuracy improving and, often, reasoning length growing over training as the model learns that more deliberation yields more correct answers — reproducing the emergent-reasoning trend of Exercise 11 in miniature.
Solution
Scoring each reasoning step and aggregating to re-rank candidate solutions typically beats outcome-only re-ranking, because the PRM catches solutions with flawed intermediate steps that happen to reach a plausible answer — the credit-assignment advantage of Exercise 9.
Solution
Best-of-N (generate N, pick the highest ORM-scored) and self-consistency (majority vote) both improve over a single answer; comparing them at matched N shows which the reward model vs voting favors — ORM-based selection can beat voting when the reward model is good, but voting needs no model. A useful empirical comparison of two test-time strategies.
Solution
Keeping the top-k partial reasoning paths scored by a PRM at each step (Exercise 10) outperforms greedy single-path CoT on multi-step tasks by exploring and pruning intelligently — demonstrating PRM-guided search, which an ORM could not provide.
Solution
Models often generate long reasoning even for easy problems (wasting tokens) with no accuracy gain. Measuring length vs accuracy across difficulty shows the waste; a stopping heuristic — e.g. stop when the model's answer stabilizes or its confidence is high — saves compute without hurting accuracy, addressing the inference-cost side of test-time scaling.
Solution
Training a small instruction model with GRPO + a math verifier and watching reasoning length and accuracy grow demonstrates RL 'baking in' reasoning. Comparing the RL-trained model against pure self-consistency and best-of-N at MATCHED inference compute shows the trade-off: RL internalizes reasoning (cheaper per query at inference) while search spends compute at test time — the central reasoning-vs-search trade-off of the chapter.