Long Context & Memory
Detailed solutions for the exercises in Chapter 33. Try solving them yourself before checking the answers.
Solution
The context window holds everything the model can attend to right now — its short-term working memory, distinct from the long-term knowledge in its weights. Tasks it enables: (1) whole-book understanding (the entire text must fit to reason across it); (2) large-codebase reasoning (attending to many files together to understand cross-file dependencies); (3) long multi-turn conversations or agent trajectories (remembering the full history without forgetting). Each needs more information held simultaneously than a short window allows.
Solution
Every token attends to every other token, so T tokens produce T×T = T² pairwise interactions (the attention matrix). T=1K → 10⁶; T=100K → 10¹⁰; T=1M → 10¹² (a trillion). Quadratic growth is punishing because each doubling of context QUADRUPLES the cost, so reaching very long contexts naively becomes astronomically expensive — the wall that every long-context technique works around.
Solution
Position encodings tell the model where each token is. A model trained only up to position 4096 has never seen positions 4097–8192, so its position encodings there are out of distribution — the model has no learned meaning for them and produces garbage. The failure is about UNSEEN POSITIONS, not compute: even with infinite compute, feeding positions the model never trained on breaks it, which is why context extension is fundamentally a position-encoding problem.
Solution
RoPE encodes position by ROTATING the query and key vectors by an angle proportional to their position, so the attention dot product depends on the relative offset between tokens. Its rotational structure enables extension because position is a smooth, continuous angle: you can mathematically 'stretch' or reinterpret those angles to cover longer ranges than trained (interpolation), something impossible with learned absolute embeddings that have a fixed vector per discrete position. RoPE's continuity is what context-extension methods exploit.
Solution
To handle 8K with a 4K-trained model, position interpolation SQUEEZES the longer range into the trained range — scale every position by 0.5 so positions run 0–4096 (in-distribution) instead of 0–8192. Ruler analogy: rather than extending a 4096-mark ruler past its end (extrapolation — no markings exist there, the model is lost), reinterpret the existing 4096 marks to span 8192 units (interpolation — every measurement still falls within the familiar range). Interpolation succeeds because the model only ever sees position values it understands; the cost is slightly reduced positional precision, recovered by brief fine-tuning.
Solution
Plain interpolation scales all RoPE frequencies UNIFORMLY, which blurs fine local positional detail. NTK-aware scaling interpolates LOW frequencies (long-range position) more than HIGH frequencies (local position), preserving local precision while still extending range. YaRN refines this with a careful per-frequency scaling plus a temperature adjustment, achieving strong extension with minimal fine-tuning. Scaling different frequencies differently achieves the best of both: extend the long-range positions without sacrificing the fine-grained local positional resolution that uniform scaling degrades.
Solution
FlashAttention computes EXACT attention but tiles the computation to avoid materializing the full T×T matrix in slow memory — it changes only the MEMORY (and traffic), not the O(T²) compute complexity. Sparse attention has each token attend to only a SUBSET of others (window, global, block patterns), which reduces the COMPUTE complexity toward O(T) but is an APPROXIMATION (it may miss some long-range interactions). FlashAttention = exact, less memory; sparse = approximate, less compute.
Solution
Standard attention forms QKᵀ first — a T×T matrix — then multiplies by V: O(T²). Linear attention approximates the softmax with a kernel feature map φ, giving φ(Q)φ(K)ᵀV. By associativity, compute φ(K)ᵀV FIRST:
The inner product φ(K)ᵀV is a small fixed (dimension×dimension) matrix, so the whole computation costs O(T·d²) — LINEAR in T. Reordering the matrix multiplication, enabled by replacing softmax with a kernel, removes the quadratic T×T matrix entirely.
Solution
State-space models process a sequence by maintaining a fixed-size STATE updated token-by-token (like an RNN), giving linear time and constant memory — no quadratic wall, no growing KV cache. Classical SSMs had a FIXED update (content-independent); Mamba's selectivity makes the update DEPEND ON THE INPUT, so the model can choose what to remember or forget dynamically — closing much of the quality gap with attention. The core trade-off: attention keeps every past token for precise RANDOM ACCESS (perfect recall, but O(T²) and growing cache), while an SSM COMPRESSES the past into a fixed state (cheap and constant memory, but may lose specific distant details). Random access vs compression.
Solution
MemGPT treats the context window like RAM and an external store like disk: the model manages a small fast context and 'pages' information in and out of a large external memory as needed, giving effectively unbounded memory with a finite window. It relates to RAG (Chapter 29) — both retrieve relevant information into the context on demand — but external memory retrieves from the model's OWN history/working memory rather than a knowledge base. Use it over a bigger window when the information is effectively unbounded (an endless conversation, a huge corpus) that no window could hold, or when keeping the context small is far cheaper than processing a giant one.
Solution
Timing a full-attention forward pass at increasing T and plotting on log-log axes yields a slope of ~2 (each doubling of T roughly quadruples the time) — the empirical confirmation of the O(T²) cost (Exercise 2) that motivates every technique in the chapter.
Solution
Rotating Q and K by position-dependent angles and checking that scores between positions m and n depend only on m−n (shifting both positions leaves the score unchanged) verifies RoPE's relative-position property (Exercise 4) — the structure that makes extension possible.
Solution
Scaling positions to squeeze a longer range into the trained range (Exercise 5) and testing on sequences beyond the original length shows the model now produces sensible output where naive extrapolation failed — demonstrating interpolation as the key to cheap context extension.
Solution
Applying per-frequency (NTK/YaRN) scaling and measuring perplexity on long sequences typically beats uniform linear interpolation (Exercise 6), because preserving local positional precision while extending range yields better long-context modeling — the empirical edge of YaRN.
Solution
A local window plus a few global tokens (Exercise 7) reduces attention cost markedly while retaining most quality on long-sequence tasks, since most dependencies are local and the global tokens carry long-range information — demonstrating sparse attention's speed/completeness trade-off.
Solution
The reordered φ(Q)(φ(K)ᵀV) form (Exercise 8) scales linearly with T (confirmed by timing), at some quality cost versus exact softmax attention on a toy task — illustrating the efficiency-vs-fidelity trade of approximating attention with a kernel.
Solution
A fixed-size state with an input-dependent update (Exercise 9) processes a long sequence in linear time with CONSTANT memory per step — versus attention's growing KV cache. Comparing memory use demonstrates the SSM's efficiency advantage, and its selectivity lets it choose what to retain.
Solution
Hiding a fact at various positions in a long context and measuring whether the model retrieves it reproduces the lost-in-the-middle effect (Chapter 29): accuracy is high at the start and end and dips in the middle — showing that a large window does not guarantee uniform use of all of it.
Solution
Storing conversation history externally and retrieving relevant pieces into a small context on demand (Exercise 10) lets the model answer a question referencing information from far earlier than the window holds — demonstrating effectively unbounded memory via retrieval, the paging analogy in code.
Solution
Combining YaRN position scaling, brief long-sequence fine-tuning, sliding-window+global sparse attention, and KV-cache management produces a working long-context model. Evaluating needle-in-a-haystack (across positions) and a multi-hop reasoning task, and comparing to a RAG baseline, reveals the trade-offs: long context wins when information is bounded and reasoning must span it all; RAG wins on cost and unbounded knowledge; and lost-in-the-middle hurts long context where key facts sit in the middle — the integrated lesson of the chapter.