Numerical Methods & Stability
Detailed solutions for the exercises in Chapter 5. Try solving them yourself before checking the answers.
Solution
Machine epsilon ≈ 2^−23 ≈ 1.19×10⁻⁷ for float32 (the smallest relative gap between representable numbers). float64 has 52 mantissa bits, giving ≈ 2^−52 ≈ 2.2×10⁻¹⁶ — about nine orders of magnitude more precise. This is why accumulations (sums, norms) are often done in float32/float64 even when storage is float16.
Solution
softmax(z)_i = e^{z_i}/Σ_j e^{z_j}. Subtracting c from every logit multiplies numerator and denominator by e^{−c}: e^{z_i−c}/Σ_j e^{z_j−c} = (e^{−c}e^{z_i})/(e^{−c}Σ e^{z_j}), and e^{−c} cancels. The probabilities are unchanged, but with c = max(z) the largest exponent becomes e^0 = 1, preventing overflow.
Solution
Σ e^{z_i} = e^c Σ e^{z_i−c}; taking logs gives logΣ e^{z_i} = c + logΣ e^{z_i−c}. Choosing c = max(z) makes the largest shifted exponent e^0 = 1 and all others ≤ 1, so no term overflows and the dominant term never underflows — the standard log-sum-exp trick.
Solution
Yes — 5×10⁻⁹ < 6×10⁻⁸, so it flushes to zero in fp16. To lift it above the minimum, scale the loss (and hence gradients) by S with S·5×10⁻⁹ ≥ 6×10⁻⁸, i.e. S ≥ 12. In practice a power of two such as S = 16 (or much larger, e.g. 1024, with dynamic loss scaling) is used; gradients are unscaled by S before the optimizer step. This is the core of mixed-precision training (Chapter 20).
Solution
Maintaining the running mean μ_n = μ_{n−1} + (x_n−μ_{n−1})/n, one can show the sum of squared deviations M_n = Σ_{i≤n}(x_i−μ_n)² satisfies M_n = M_{n−1} + (x_n−μ_{n−1})(x_n−μ_n). The cross terms from updating μ cancel exactly, so the recurrence accumulates the centered sum of squares in one pass, numerically stably — avoiding the catastrophic cancellation of the naive E[X²]−E[X]² formula (Exercise 10).
Solution
Truncation error of forward difference ∝ h ≈ 10⁻⁵; central difference ∝ h² ≈ 10⁻¹⁰. Central differencing is far more accurate for the same step (though both are eventually limited by floating-point rounding, which grows as h shrinks — there is an optimal h balancing truncation and rounding).
Solution
Naive softmax overflows when e^{z} exceeds the float max: in fp32 (max ≈ 3.4×10³⁸) this happens around z ≈ 88; in fp16 (max ≈ 65504) around z ≈ 11. The stable version (subtract max first) never overflows at any magnitude, producing correct probabilities where the naive one returns NaN.
Solution
Finite-difference checking compares analytic gradients to (f(x+ε)−f(x−ε))/2ε. Correct ReLU (grad 1 for x>0, else 0) and GELU pass within ~1e−5. Flipping the sign of the ReLU derivative makes the analytic and numeric gradients disagree by ~2× the true value, which the check flags immediately — demonstrating gradient_check as a bug detector.
Solution
Subtract the max before exponentiating: lse(z) = max(z) + logΣ e^{z−max(z)}. This matches scipy.special.logsumexp on random vectors and survives edge cases (all −inf → −inf; very large values → no overflow; very small → no underflow) where a naive log(Σ exp) fails.
Solution
The naive E[X²]−E[X]² subtracts two huge, nearly-equal float32 numbers (~10¹⁶), losing all significant digits — the result can be wildly wrong or even negative. The two-pass (or Welford) method subtracts the mean first, working with the small deviations [1,2,3], and recovers the true variance (≈0.667). This is the textbook example of catastrophic cancellation.
Solution
With a fixed seed and deterministic algorithms, the two runs produce bit-for-bit identical weights and losses. Enabling nondeterministic GPU kernels (e.g. atomic-add reductions whose order varies) makes the runs diverge in the low-order bits and then, through chaotic amplification, in visible digits — illustrating why reproducibility requires deterministic settings.
Solution
Mixed precision stores activations and (often) a working copy of weights in fp16, roughly halving activation and gradient memory, though a master fp32 copy of weights and optimizer state remains. Measured with torch.cuda.max_memory_allocated(), peak memory typically drops to ~0.5–0.65× of fp32, with the exact ratio depending on how much of the footprint is activations (which halve) versus optimizer state (which may not).