Solutions Appendix
Chapter 9

Sequence Models: RNNs & LSTMs

18 Solutions

Detailed solutions for the exercises in Chapter 9. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper
Write the full BPTT gradient ∂L/∂W_hh for a 3-step RNN; identify the vanishing-gradient term.

Solution

∂L/∂W_hh = Σ_{t} ∂L/∂h_t · (Π_{k<≤t} ∂h_ℓ/∂h_{ℓ−1}) · ∂h_k/∂W_hh, summed over time steps t and their contributing earlier steps k. The dangerous factor is the product of Jacobians Π ∂h_ℓ/∂h_{ℓ−1} = Π diag(tanh')·W_hh. Repeated multiplication of these terms shrinks the gradient toward zero (vanishing) or blows it up (exploding) as the gap t−k grows — the core RNN training problem.

Exercise 2Pen & Paper
Show tanh' = 1−tanh² is at most 1 and →0 as |a|→∞; how does this compound vanishing?

Solution

Since tanh ranges in (−1,1), tanh² ∈ [0,1), so 1−tanh² ∈ (0,1], maximized at 1 when a=0 and →0 as |a|→∞ (saturation). In BPTT each backward step multiplies by these ≤ 1 factors (and by W_hh); over many steps the product of sub-1 derivatives drives the gradient exponentially toward zero, so distant time steps receive almost no learning signal — the vanishing-gradient mechanism.

Exercise 3Pen & Paper
For c_t = f_t⊙c_{t−1} + i_t⊙c̃_t, compute ∂c_t/∂c_{t−1}; why does it protect gradients when f_t≈1?

Solution

The direct path gives ∂c_t/∂c_{t−1} = f_t (element-wise), ignoring the smaller indirect contributions through the gates. When the forget gate f_t ≈ 1, this derivative is ≈ 1, so gradients flow back through the cell state essentially undiminished — the 'constant error carousel'. Contrast the vanilla RNN's ∂h_t/∂h_{t−1} = diag(tanh')W_hh, whose sub-1 factors cause vanishing. The additive, gated cell state is the LSTM's gradient highway.

Exercise 4Pen & Paper
Count LSTM vs GRU parameters; verify the GRU has 25% fewer recurrent parameters.

Solution

An LSTM has 4 gates (input, forget, output, candidate), each with a W (H×D), a U (H×H), and a bias: total 4(HD + H² + H). A GRU has 3 gates (reset, update, candidate): 3(HD + H² + H). Comparing the recurrent (H²) terms, 3H² vs 4H² is exactly 25% fewer — the GRU's efficiency comes from merging the LSTM's cell/hidden state and dropping one gate.

Exercise 5Pen & Paper
If the forget gate is always 1 and input gate always 0, show c_t = c_0. What does this achieve?

Solution

Then c_t = 1·c_{t−1} + 0·c̃_t = c_{t−1}, so by induction c_t = c_0 for all t — the cell state is perfectly preserved across arbitrarily many steps. This is the 'constant error carousel': information (and gradient) can be carried forward indefinitely without decay, which is precisely how LSTMs retain long-range dependencies that vanilla RNNs forget.

Exercise 6Pen & Paper
In Bahdanau attention, prove the context vector lies in the convex hull of encoder states.

Solution

The alignment weights α_{tj} = softmax(e_{tj}) are non-negative and sum to 1 over j. The context c_t = Σ_j α_{tj} h_j is therefore a convex combination of the encoder states {h_j}, which by definition lies in their convex hull. So attention can only produce blends 'between' the encoder representations — it reweights existing information rather than inventing new directions.

Exercise 7Pen & Paper
Compare path length between positions i and j in an RNN, a biRNN, and a Transformer. Why does it matter?

Solution

Vanilla RNN: O(|i−j|) — information must traverse every intermediate step. Bidirectional RNN: still O(|i−j|) per direction (though both ends are reachable). Transformer self-attention: O(1) — any two positions interact directly in one layer. Shorter path length means fewer multiplicative Jacobian factors between distant tokens, so gradients survive and long-range dependencies are far easier to learn — a key reason Transformers outperform RNNs.

Exercise 8Pen & Paper
Argue information-theoretically why fixed-vector seq2seq quality degrades as T grows.

Solution

The encoder must compress a T-token sentence into a single fixed H-dimensional vector of bounded capacity. As T grows, the amount of information to store exceeds what H dimensions can faithfully represent, so detail is inevitably lost — the fixed bottleneck cannot scale with input length. Translation quality therefore degrades for long sentences, which is exactly the bottleneck attention removes by letting the decoder revisit all encoder states.

Exercise 9Pen & Paper
Derive the gradient of Luong dot-product score sᵀh w.r.t. s and h; why is it cheaper than Bahdanau?

Solution

For the score a = sᵀh, ∂a/∂s = h and ∂a/∂h = s — trivial gradients with no extra parameters. Bahdanau's additive score vᵀtanh(W_s s + W_h h) requires learned matrices W_s, W_h, a vector v, and a tanh nonlinearity, so it is more expensive in both compute and parameters. The dot-product form (later central to Transformer attention) is cheaper because alignment is just an inner product.

Exercise 10Pen & Paper
Why can't truncated BPTT with length k learn dependencies longer than k within one backward pass?

Solution

Truncated BPTT only backpropagates gradients k steps before cutting the graph. A dependency spanning more than k steps has no gradient path connecting the distant cause to the effect within a single backward pass, so the weights responsible for carrying that information receive no learning signal for it. The forward hidden state still passes information forward indefinitely, but learning to USE long-range information requires a gradient path, which truncation severs beyond k.

Exercise 11Code
Vanilla RNN char-LM in NumPy; show it fails to balance parentheses opened >~15 chars earlier.

Solution

Train a small char-level RNN to predict the next character on bracketed text. It learns local statistics but, due to vanishing gradients, fails to close parentheses opened more than ~15 characters back — the closing-bracket prediction degrades with distance. This empirically demonstrates the limited effective memory of vanilla RNNs.

Exercise 12Code
Implement the LSTM cell forward and backward from scratch; verify with finite differences.

Solution

Implement the four gate equations forward, then derive the backward pass (gradients through the gates and the additive cell-state path). The Chapter-5 finite-difference checker should confirm all gate and weight gradients to ~1e−5, validating the hand-derived LSTM backprop — including the f_t gradient highway from Exercise 3.

Exercise 13Code
Implement a GRU; on a copy task compare GRU vs vanilla RNN as delay N grows 5→100.

Solution

On the copy task (reproduce a sequence after an N-step delay), the GRU maintains high accuracy to much larger N than the vanilla RNN, whose accuracy collapses once N exceeds its short effective memory (~10–15). The plot shows the gating mechanism extending usable memory dramatically — the practical payoff of gated recurrence.

Exercise 14Code Lab
Build a seq2seq model WITHOUT attention for a toy task; plot accuracy vs source length.

Solution

A plain encoder–decoder solves short sequences but its accuracy falls off as source length grows, because the single fixed context vector saturates (Exercise 8). The declining accuracy-vs-length curve is the visible signature of the fixed-bottleneck problem.

Exercise 15Code Lab
Add Bahdanau attention; re-plot accuracy vs length; visualize the attention matrix.

Solution

With attention the decoder can look back at all encoder states, so the accuracy-vs-length curve flattens — long sequences no longer degrade. The attention heatmap shows a roughly diagonal alignment (and meaningful re-orderings for tasks like reversal), visually confirming that attention dissolves the bottleneck by learning which source positions to attend to at each output step.

Exercise 16Code
Gradient highway: train LSTM vs vanilla RNN on an 80-step dependency; plot gradient norm reaching t=1.

Solution

For a task whose output depends on the input at t=1 across an 80-step gap, the LSTM maintains a non-negligible gradient norm reaching t=1 (thanks to the cell-state highway), while the vanilla RNN's gradient at t=1 decays to near zero. The plot makes the vanishing-vs-preserved gradient contrast concrete.

Exercise 17Code
Bidirectional LSTM POS tagger; compare uni- vs bidirectional F1; which tags benefit from right context?

Solution

A bidirectional LSTM reads the sentence both ways, so each tag decision sees future context. F1 improves over the unidirectional model, with the largest gains on tags that depend on what follows — e.g. disambiguating a word as noun vs verb often requires the next word. Right-context-dependent categories benefit most.

Exercise 18Code (Challenge)
Full attention seq2seq translator on a small real dataset; teacher forcing, beam search, attention viz; then reflect.

Solution

Build encoder–decoder with Bahdanau/Luong attention, train with teacher forcing, decode with beam search, and visualize alignments; report BLEU. The reflection should note the limitations encountered — sequential (non-parallel) training, limited long-range modeling, and the lingering compression in the recurrent decoder — and explain how the Transformer addresses each: full parallelism over positions, O(1) attention path length, and no recurrent bottleneck (Chapters 12–13).