Solutions Appendix
Chapter 25

Reasoning & Chain-of-Thought

22 Solutions

Detailed solutions for the exercises in Chapter 25. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper
Why does forcing an immediate answer struggle on multi-step problems? Connect to fixed per-pass computation.

Solution

A single forward pass applies a FIXED amount of computation (one pass through the layers) before emitting the next token, so a problem needing many sequential steps of reasoning cannot be solved in that fixed budget if the answer must come immediately. Chain-of-thought lets the model spread the computation across many generated tokens — each token is another forward pass — effectively giving it more serial computation for harder problems. Forcing immediacy caps the usable compute per problem.

Exercise 2Pen & Paper
Explain the System 1 / System 2 analogy; which mode is direct answering, which is CoT?

Solution

System 1 is fast, automatic, intuitive; System 2 is slow, deliberate, step-by-step (Kahneman). Direct answering is System 1 — a single intuitive leap to the answer. Chain-of-thought is System 2 — working through intermediate steps deliberately. CoT lets the model engage a 'slow thinking' mode for problems where intuition alone is unreliable, mirroring how humans switch to careful reasoning for hard problems.

Exercise 3Pen & Paper
Two reasons CoT improves accuracy (computation and decomposition); illustrate each.

Solution

(1) Computation: each reasoning token is another forward pass, so CoT grants more total compute for the problem — e.g. carrying out a long multiplication digit by digit rather than guessing the product. (2) Decomposition: breaking a problem into sub-steps makes each step easy and lets later steps condition on earlier results — e.g. a word problem solved by first extracting quantities, then forming an equation, then solving it. More compute AND easier sub-problems both raise accuracy.

Exercise 4Pen & Paper
Why does CoT teach the model nothing new? Connect to the superficial alignment hypothesis.

Solution

CoT prompting elicits reasoning the model is already CAPABLE of (from pretraining) — it changes the inference-time process, not the weights, so no new knowledge is added. This mirrors the superficial alignment hypothesis (Chapter 22): capabilities live in the pretrained model, and prompting/alignment merely surfaces them. CoT unlocks latent reasoning ability rather than creating it.

Exercise 5Pen & Paper
Explain self-consistency; why does majority voting beat a single greedy path? When does it NOT apply?

Solution

Self-consistency samples multiple independent CoT paths and takes the majority answer. It beats greedy decoding because different paths make different errors, but correct reasoning tends to converge on the same answer, so the majority is more reliable than any single (possibly flawed) chain. It does NOT apply when there is no well-defined final answer to vote on (open-ended generation) or when all paths share the same systematic bias (they'd agree on the wrong answer).

Exercise 6Pen & Paper
Describe test-time compute scaling; contrast with train-time scaling; why a per-query dial?

Solution

Test-time scaling spends more compute at INFERENCE — longer reasoning, more samples, search — to improve a single answer, whereas train-time scaling (Chapter 16) spends compute once to make a better model. Test-time compute is a per-query dial: you can choose to think longer on a hard question and answer cheaply on an easy one, allocating compute where it helps. It trades inference cost for accuracy, controllable on the fly.

Exercise 7Pen & Paper
Explain RLVR; why does a verifiable reward avoid RLHF's reward hacking?

Solution

RLVR (RL from Verifiable Rewards) trains on tasks where correctness can be checked programmatically — math answers, passing unit tests — giving a reward of 1/0 from ground truth rather than a learned reward model. Because the reward is the actual objective (correctness), not an imperfect proxy, there is nothing to 'hack': the only way to score is to be genuinely correct. This sidesteps the Goodhart/reward-hacking problem (Chapter 23) that plagues learned reward models.

Exercise 8Pen & Paper
Why is GRPO well-suited to reasoning with verifiable rewards? What does the group provide?

Solution

With verifiable rewards, each sampled solution gets a clean 0/1 score. GRPO samples a GROUP of solutions per problem and uses their mean as the baseline, so a solution's advantage is simply how it compares to its peers — no value model or reward model needed. This fits reasoning perfectly: the group of attempts at one problem provides the baseline, the verifier provides the reward, and group-relative advantages reward the solutions that solved it among many attempts.

Exercise 9Pen & Paper
Compare ORM and PRM; the credit-assignment argument for PRMs.

Solution

An Outcome Reward Model scores only the FINAL answer; a Process Reward Model scores each intermediate STEP. PRMs help more because of credit assignment: when a long reasoning chain fails, an ORM only knows the whole thing was wrong, not WHERE; a PRM pinpoints which step went wrong, giving denser, more localized signal. This lets training (or search) reward correct intermediate steps and catch errors early, rather than only judging the end.

Exercise 10Pen & Paper
Describe one search method over reasoning paths; how does a PRM guide step-level search, why can't an ORM?

Solution

Beam search over reasoning keeps the top-k partial reasoning paths at each step, expanding and re-scoring them. A PRM can score each PARTIAL path step-by-step, so it guides which partial paths to keep BEFORE the answer exists — enabling step-level pruning. An ORM only scores complete answers, so it has nothing to say about partial paths and cannot guide the search until the end — which is why step-level search needs a process reward.

Exercise 11Pen & Paper
Explain the 'aha moment' / emergent reasoning in R1-style training; why is it significant?

Solution

In R1-style RLVR training, models spontaneously began producing longer reasoning, backtracking, and self-correction (the 'aha moment') without being shown any reasoning demonstrations — the behavior emerged purely from optimizing for correct answers. This is significant because it shows sophisticated reasoning can arise from a simple verifiable-reward signal alone, not requiring expensive human reasoning traces — reasoning was elicited and amplified by RL, suggesting the capability was latent and reward-shapable.

Exercise 12Code
Implement zero-shot and few-shot CoT on arithmetic; compare accuracy with/without the reasoning prompt.

Solution

Adding 'Let's think step by step' (zero-shot CoT) or worked examples (few-shot CoT) substantially raises accuracy on multi-step arithmetic versus answering directly — demonstrating the computation/decomposition benefits of Exercise 3 in practice.

Exercise 13Code
Implement self-consistency: sample N CoT paths, parse answers, majority-vote; plot accuracy vs N.

Solution

Majority-voting over N sampled chains improves accuracy over a single path, with accuracy rising as N grows and then saturating (Exercise 5). The curve quantifies the diminishing returns of more samples — a form of test-time compute scaling.

Exercise 14Code
Build a test-time scaling curve: accuracy vs reasoning length / sample count.

Solution

Plotting accuracy against the inference compute spent (longer reasoning or more samples) shows it rising with compute — the controllable per-query dial of Exercise 6, demonstrating that thinking longer measurably helps on reasoning tasks.

Exercise 15Code
Implement a verifiable arithmetic reward (parse + check); test on generated solutions.

Solution

Parsing the model's final answer and comparing to ground truth yields a clean 1/0 reward (Exercise 7). Testing it confirms it correctly rewards only genuinely correct solutions — the verifier that makes RLVR hack-proof.

Exercise 16Code Lab
Build a rejection-sampling reasoning dataset (keep verified-correct CoT); SFT; show improvement.

Solution

Generating many CoT solutions, keeping only verified-correct ones, and SFT-ing on them distills correct reasoning into the model, improving its accuracy — a simple, stable alternative to RL (the reasoning analogue of best-of-N from Chapter 23).

Exercise 17Code Lab
Implement RLVR with GRPO: sample a group, verify, group-relative advantages, update; track accuracy and length.

Solution

Running GRPO with the verifiable reward (Exercises 7–8) shows accuracy improving and, often, reasoning length growing over training as the model learns that more deliberation yields more correct answers — reproducing the emergent-reasoning trend of Exercise 11 in miniature.

Exercise 18Code
Implement a PRM interface (score each step valid/invalid); re-rank candidates; compare to outcome-only.

Solution

Scoring each reasoning step and aggregating to re-rank candidate solutions typically beats outcome-only re-ranking, because the PRM catches solutions with flawed intermediate steps that happen to reach a plausible answer — the credit-assignment advantage of Exercise 9.

Exercise 19Code
Best-of-N with an ORM vs self-consistency at the same N; compare accuracy.

Solution

Best-of-N (generate N, pick the highest ORM-scored) and self-consistency (majority vote) both improve over a single answer; comparing them at matched N shows which the reward model vs voting favors — ORM-based selection can beat voting when the reward model is good, but voting needs no model. A useful empirical comparison of two test-time strategies.

Exercise 20Code
Step-level beam search with a (mock) PRM; compare to greedy CoT.

Solution

Keeping the top-k partial reasoning paths scored by a PRM at each step (Exercise 10) outperforms greedy single-path CoT on multi-step tasks by exploring and pruning intelligently — demonstrating PRM-guided search, which an ORM could not provide.

Exercise 21Code
Measure overthinking: reasoning length and accuracy on easy vs hard problems; propose a stopping heuristic.

Solution

Models often generate long reasoning even for easy problems (wasting tokens) with no accuracy gain. Measuring length vs accuracy across difficulty shows the waste; a stopping heuristic — e.g. stop when the model's answer stabilizes or its confidence is high — saves compute without hurting accuracy, addressing the inference-cost side of test-time scaling.

Exercise 22Code (Challenge)
Full mini reasoning-model pipeline: verifier + GRPO RLVR; observe length/accuracy growth; compare to self-consistency and best-of-N at matched compute.

Solution

Training a small instruction model with GRPO + a math verifier and watching reasoning length and accuracy grow demonstrates RL 'baking in' reasoning. Comparing the RL-trained model against pure self-consistency and best-of-N at MATCHED inference compute shows the trade-off: RL internalizes reasoning (cheaper per query at inference) while search spends compute at test time — the central reasoning-vs-search trade-off of the chapter.