Calculus & Optimization
Detailed solutions for the exercises in Chapter 2. Try solving them yourself before checking the answers.
Solution
f'(x) = 3x² − 10x + 3. Setting it to zero: x = (10 ± √(100−36))/6 = (10 ± 8)/6, giving x = 3 and x = 1/3. These are the critical points (a local max at 1/3 and a local min at 3, since f'' = 6x−10 changes sign).
Solution
Three nested functions: outer e^u with u=sin(v), inner v=x². By the chain rule g'(x) = e^{sin(x²)} · cos(x²) · 2x. Each factor is the derivative of one layer: d(e^u)/du = e^u, d(sin v)/dv = cos v, d(x²)/dx = 2x — exactly how backprop chains local derivatives.
Solution
∂f/∂x = 2xy and ∂f/∂y = x² + 3y². At (2,1): ∂f/∂x = 4, ∂f/∂y = 4 + 3 = 7, so ∇f(2,1) = [4, 7]. The gradient points in the direction of steepest increase of f at that point.
Solution
Let s = σ(wx+b). Then dz/ds = 1/s, ds/d(wx+b) = s(1−s), and d(wx+b)/dw = x. Multiplying: ∂z/∂w = (1/s)·s(1−s)·x = (1−σ(wx+b))·x. This is exactly the gradient of the log-likelihood term in logistic regression.
Solution
Algebraically f = a² − b², so ∂f/∂a = 2a and ∂f/∂b = −2b. Via the graph (p=a+b, q=a−b, f=pq): ∂f/∂p = q, ∂f/∂q = p; then ∂f/∂a = q·1 + p·1 = p+q = 2a, and ∂f/∂b = q·1 + p·(−1) = q−p = −2b. The two routes agree — the local gradients accumulate through both paths from a (and b).
Solution
Adam tracks m ≈ mean(g) and v ≈ mean(g²). Scaling g by c scales m by c and v by c². The update η·m̂/(√v̂+ε) then has c in the numerator and √(c²)=c in the denominator, which cancel (ignoring ε). So the step size is invariant to the overall gradient magnitude. This makes Adam robust to loss scaling and to per-layer gradient-scale differences — it adapts a sensible step regardless of raw magnitude.
Solution
Let r = Wx − y (the residual). Then L = (1/N)‖r‖² and ∂L/∂W = (2/N)·r·xᵀ (an outer product matching W's shape). The gradient is the residual scaled by the input — the workhorse of least-squares and the linear layer's backward pass.
Solution
The sigmoid derivative is σ(1−σ), which peaks at 0.25 and is below 1 everywhere. Backprop multiplies these across layers, so after L layers gradients can shrink by up to 0.25^L — vanishing exponentially. The ReLU derivative is exactly 1 for positive inputs, so gradients pass through undiminished (the cost is 'dead' units where the input is negative). This is why ReLU largely cured the vanishing-gradient problem that plagued deep sigmoid networks.
Solution
With no gradient signal, the weight follows w_T = w_0·(1−ηλ)^T. Since 0 < 1−ηλ < 1 for reasonable ηλ, this is geometric decay toward 0. Decoupled weight decay thus pulls unused weights smoothly toward zero, independent of the adaptive gradient scaling — the property that distinguishes AdamW from L2 inside Adam.
Solution
Nodes: u = a·b, v = u + c, L = v². Backward: ∂L/∂v = 2v; ∂v/∂u = 1 and ∂v/∂c = 1; ∂u/∂a = b and ∂u/∂b = a. Chaining: ∂L/∂a = 2v·b, ∂L/∂b = 2v·a, ∂L/∂c = 2v, where v = ab+c. Each gradient is the upstream gradient times the local derivative — the essence of reverse-mode autodiff.
Solution
Compute softmax s, build the analytic Jacobian J = diag(s) − s sᵀ, and compare to a finite-difference estimate (perturb each x_i by ε and measure the change in each output). They agree to ~1e−6 in float64.
s = softmax(x); J = np.diag(s) - np.outer(s, s)Solution
Each op defines a forward value and a local-gradient closure: subtraction passes +1/−1 to its inputs; for x^p the local grad is p·x^{p−1}; for log(x) it is 1/x; for tanh(x) it is 1−tanh²(x). gradient_check (finite differences) should match analytic grads to ~1e−5, confirming each backward is correct.
Solution
A 2–2–1 network with a nonlinearity (tanh/ReLU) and MSE or BCE loss solves XOR; linear models cannot. Trained to convergence the loss approaches ~0 and the hidden layer learns to carve the input plane so the two XOR-positive points become linearly separable in hidden space.
Solution
Implement CE = −log(softmax(logits)[target]); its analytic gradient w.r.t. logits is p − y_onehot. Finite-difference checking confirms the gradient, and the loss value matches PyTorch's F.cross_entropy (which fuses log-softmax and NLL for stability).
Solution
Without clipping the gradient norm typically spikes early then settles, but can show occasional large spikes (instability). With gradient clipping the norm is capped at max_norm, removing the spikes and yielding a smoother, more stable trajectory — demonstrating why clipping is standard in transformer training (Chapter 15).
Solution
Maintain biased first/second moments m,v; bias-correct as m̂ = m/(1−β₁^t), v̂ = v/(1−β₂^t); update θ −= η·m̂/(√v̂+ε). With matching hyperparameters and seed, your loss curve overlays torch.optim.Adam's almost exactly, validating the implementation.
m = β₁*m + (1-β₁)*g; v = β₂*v + (1-β₂)*g**2
mhat = m/(1-β₁**t); vhat = v/(1-β₂**t)
θ -= lr * mhat / (vhat**0.5 + eps)