Solutions Appendix
Chapter 2

Calculus & Optimization

16 Solutions

Detailed solutions for the exercises in Chapter 2. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper
Find the derivative of f(x)=x³−5x²+3x+7. At which x is it zero?

Solution

f'(x) = 3x² − 10x + 3. Setting it to zero: x = (10 ± √(100−36))/6 = (10 ± 8)/6, giving x = 3 and x = 1/3. These are the critical points (a local max at 1/3 and a local min at 3, since f'' = 6x−10 changes sign).

Exercise 2Pen & Paper
Differentiate g(x)=e^(sin(x²)); show all chain-rule steps.

Solution

Three nested functions: outer e^u with u=sin(v), inner v=x². By the chain rule g'(x) = e^{sin(x²)} · cos(x²) · 2x. Each factor is the derivative of one layer: d(e^u)/du = e^u, d(sin v)/dv = cos v, d(x²)/dx = 2x — exactly how backprop chains local derivatives.

Exercise 3Pen & Paper
For f(x,y)=x²y+y³, find ∂f/∂x, ∂f/∂y, and the gradient at (2,1).

Solution

∂f/∂x = 2xy and ∂f/∂y = x² + 3y². At (2,1): ∂f/∂x = 4, ∂f/∂y = 4 + 3 = 7, so ∇f(2,1) = [4, 7]. The gradient points in the direction of steepest increase of f at that point.

Exercise 4Pen & Paper
Apply the chain rule to z=log(sigmoid(wx+b)); find ∂z/∂w.

Solution

Let s = σ(wx+b). Then dz/ds = 1/s, ds/d(wx+b) = s(1−s), and d(wx+b)/dw = x. Multiplying: ∂z/∂w = (1/s)·s(1−s)·x = (1−σ(wx+b))·x. This is exactly the gradient of the log-likelihood term in logistic regression.

Exercise 5Pen & Paper
Derive the backward pass for f(a,b)=(a+b)×(a−b).

Solution

Algebraically f = a² − b², so ∂f/∂a = 2a and ∂f/∂b = −2b. Via the graph (p=a+b, q=a−b, f=pq): ∂f/∂p = q, ∂f/∂q = p; then ∂f/∂a = q·1 + p·1 = p+q = 2a, and ∂f/∂b = q·1 + p·(−1) = q−p = −2b. The two routes agree — the local gradients accumulate through both paths from a (and b).

Exercise 6Pen & Paper
Show the Adam update is invariant to rescaling gradients (g → cg). Why useful?

Solution

Adam tracks m ≈ mean(g) and v ≈ mean(g²). Scaling g by c scales m by c and v by c². The update η·m̂/(√v̂+ε) then has c in the numerator and √(c²)=c in the denominator, which cancel (ignoring ε). So the step size is invariant to the overall gradient magnitude. This makes Adam robust to loss scaling and to per-layer gradient-scale differences — it adapts a sensible step regardless of raw magnitude.

Exercise 7Pen & Paper
For L = (Wx−y)²/N (linear MSE), derive ∂L/∂W analytically.

Solution

Let r = Wx − y (the residual). Then L = (1/N)‖r‖² and ∂L/∂W = (2/N)·r·xᵀ (an outer product matching W's shape). The gradient is the residual scaled by the input — the workhorse of least-squares and the linear layer's backward pass.

Exercise 8Pen & Paper
Why is gradient vanishing worse for sigmoid than ReLU? Use the derivative values.

Solution

The sigmoid derivative is σ(1−σ), which peaks at 0.25 and is below 1 everywhere. Backprop multiplies these across layers, so after L layers gradients can shrink by up to 0.25^L — vanishing exponentially. The ReLU derivative is exactly 1 for positive inputs, so gradients pass through undiminished (the cost is 'dead' units where the input is negative). This is why ReLU largely cured the vanishing-gradient problem that plagued deep sigmoid networks.

Exercise 9Pen & Paper
In AdamW, weight decay multiplies weights by (1−ηλ) each step. After T steps with no gradient, what does the weight converge to?

Solution

With no gradient signal, the weight follows w_T = w_0·(1−ηλ)^T. Since 0 < 1−ηλ < 1 for reasonable ηλ, this is geometric decay toward 0. Decoupled weight decay thus pulls unused weights smoothly toward zero, independent of the adaptive gradient scaling — the property that distinguishes AdamW from L2 inside Adam.

Exercise 10Pen & Paper
Draw the graph for L=(a×b+c)² and write the backward pass.

Solution

Nodes: u = a·b, v = u + c, L = v². Backward: ∂L/∂v = 2v; ∂v/∂u = 1 and ∂v/∂c = 1; ∂u/∂a = b and ∂u/∂b = a. Chaining: ∂L/∂a = 2v·b, ∂L/∂b = 2v·a, ∂L/∂c = 2v, where v = ab+c. Each gradient is the upstream gradient times the local derivative — the essence of reverse-mode autodiff.

Exercise 11Code
Numerically verify ∂(softmax)/∂x_i against the analytic Jacobian J_ij = s_i(δ_ij − s_j).

Solution

Compute softmax s, build the analytic Jacobian J = diag(s) − s sᵀ, and compare to a finite-difference estimate (perturb each x_i by ε and measure the change in each output). They agree to ~1e−6 in float64.

s = softmax(x); J = np.diag(s) - np.outer(s, s)
Exercise 12Code Lab 2.1
Extend the scalar autograd engine with __sub__, __pow__, log(), tanh(); verify with gradient_check.

Solution

Each op defines a forward value and a local-gradient closure: subtraction passes +1/−1 to its inputs; for x^p the local grad is p·x^{p−1}; for log(x) it is 1/x; for tanh(x) it is 1−tanh²(x). gradient_check (finite differences) should match analytic grads to ~1e−5, confirming each backward is correct.

Exercise 13Code
Train a 2-layer MLP on XOR (4 samples) with your autograd engine; report final loss and weights.

Solution

A 2–2–1 network with a nonlinearity (tanh/ReLU) and MSE or BCE loss solves XOR; linear models cannot. Trained to convergence the loss approaches ~0 and the hidden layer learns to carve the input plane so the two XOR-positive points become linearly separable in hidden space.

Exercise 14Code Lab 2.2
Apply gradient_check to a custom cross-entropy loss; confirm it matches F.cross_entropy.

Solution

Implement CE = −log(softmax(logits)[target]); its analytic gradient w.r.t. logits is p − y_onehot. Finite-difference checking confirms the gradient, and the loss value matches PyTorch's F.cross_entropy (which fuses log-softmax and NLL for stability).

Exercise 15Code
Plot ‖g‖₂ over 1000 steps on an MLP. Does it grow, shrink, or stabilize? Repeat with clipping.

Solution

Without clipping the gradient norm typically spikes early then settles, but can show occasional large spikes (instability). With gradient clipping the norm is capped at max_norm, removing the spikes and yielding a smoother, more stable trajectory — demonstrating why clipping is standard in transformer training (Chapter 15).

Exercise 16Code (Challenge)
Implement Adam from scratch and train linear regression; compare to torch.optim.Adam.

Solution

Maintain biased first/second moments m,v; bias-correct as m̂ = m/(1−β₁^t), v̂ = v/(1−β₂^t); update θ −= η·m̂/(√v̂+ε). With matching hyperparameters and seed, your loss curve overlays torch.optim.Adam's almost exactly, validating the implementation.

m = β₁*m + (1-β₁)*g;  v = β₂*v + (1-β₂)*g**2
mhat = m/(1-β₁**t); vhat = v/(1-β₂**t)
θ -= lr * mhat / (vhat**0.5 + eps)