Solutions Appendix

Chapter 33

Long Context & Memory

20 Solutions

Detailed solutions for the exercises in Chapter 33. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper

Why is context length 'working memory'? Three tasks it enables and why.

Solution

The context window holds everything the model can attend to right now — its short-term working memory, distinct from the long-term knowledge in its weights. Tasks it enables: (1) whole-book understanding (the entire text must fit to reason across it); (2) large-codebase reasoning (attending to many files together to understand cross-file dependencies); (3) long multi-turn conversations or agent trajectories (remembering the full history without forgetting). Each needs more information held simultaneously than a short window allows.

Exercise 2Pen & Paper

Why is attention O(T²)? Pairwise interactions at T=1K, 100K, 1M; why is quadratic punishing?

Solution

Every token attends to every other token, so T tokens produce T×T = T² pairwise interactions (the attention matrix). T=1K → 10⁶; T=100K → 10¹⁰; T=1M → 10¹² (a trillion). Quadratic growth is punishing because each doubling of context QUADRUPLES the cost, so reaching very long contexts naively becomes astronomically expensive — the wall that every long-context technique works around.

Exercise 3Pen & Paper

Why does a 4K-trained model break at 8K, ignoring compute? Connect to position encodings.

Solution

Position encodings tell the model where each token is. A model trained only up to position 4096 has never seen positions 4097–8192, so its position encodings there are out of distribution — the model has no learned meaning for them and produces garbage. The failure is about UNSEEN POSITIONS, not compute: even with infinite compute, feeding positions the model never trained on breaks it, which is why context extension is fundamentally a position-encoding problem.

Exercise 4Pen & Paper

Explain RoPE at a high level; why does its rotational structure enable context extension?

Solution

RoPE encodes position by ROTATING the query and key vectors by an angle proportional to their position, so the attention dot product depends on the relative offset between tokens. Its rotational structure enables extension because position is a smooth, continuous angle: you can mathematically 'stretch' or reinterpret those angles to cover longer ranges than trained (interpolation), something impossible with learned absolute embeddings that have a fixed vector per discrete position. RoPE's continuity is what context-extension methods exploit.

Exercise 5Pen & Paper

Explain position interpolation via the ruler analogy; why interpolation succeeds where extrapolation fails.

Solution

To handle 8K with a 4K-trained model, position interpolation SQUEEZES the longer range into the trained range — scale every position by 0.5 so positions run 0–4096 (in-distribution) instead of 0–8192. Ruler analogy: rather than extending a 4096-mark ruler past its end (extrapolation — no markings exist there, the model is lost), reinterpret the existing 4096 marks to span 8192 units (interpolation — every measurement still falls within the familiar range). Interpolation succeeds because the model only ever sees position values it understands; the cost is slightly reduced positional precision, recovered by brief fine-tuning.

Exercise 6Pen & Paper

Compare position interpolation, NTK-aware scaling, and YaRN; what does per-frequency scaling achieve?

Solution

Plain interpolation scales all RoPE frequencies UNIFORMLY, which blurs fine local positional detail. NTK-aware scaling interpolates LOW frequencies (long-range position) more than HIGH frequencies (local position), preserving local precision while still extending range. YaRN refines this with a careful per-frequency scaling plus a temperature adjustment, achieving strong extension with minimal fine-tuning. Scaling different frequencies differently achieves the best of both: extend the long-range positions without sacrificing the fine-grained local positional resolution that uniform scaling degrades.

Exercise 7Pen & Paper

Distinguish FlashAttention from sparse attention; which changes complexity, which only memory?

Solution

FlashAttention computes EXACT attention but tiles the computation to avoid materializing the full T×T matrix in slow memory — it changes only the MEMORY (and traffic), not the O(T²) compute complexity. Sparse attention has each token attend to only a SUBSET of others (window, global, block patterns), which reduces the COMPUTE complexity toward O(T) but is an APPROXIMATION (it may miss some long-range interactions). FlashAttention = exact, less memory; sparse = approximate, less compute.

Exercise 8Derive

Show linear attention reorders softmax(QKᵀ)V to φ(Q)(φ(K)ᵀV); why O(T²)→O(T)?

Solution

Standard attention forms QKᵀ first — a T×T matrix — then multiplies by V: O(T²). Linear attention approximates the softmax with a kernel feature map φ, giving φ(Q)φ(K)ᵀV. By associativity, compute φ(K)ᵀV FIRST:

φ(Q)·(φ(K)ᵀ V) — the inner term is d×d, independent of T

The inner product φ(K)ᵀV is a small fixed (dimension×dimension) matrix, so the whole computation costs O(T·d²) — LINEAR in T. Reordering the matrix multiplication, enabled by replacing softmax with a kernel, removes the quadratic T×T matrix entirely.

Exercise 9Pen & Paper

Explain SSMs and Mamba's 'selectivity'; the random-access vs compression trade-off.

Solution

State-space models process a sequence by maintaining a fixed-size STATE updated token-by-token (like an RNN), giving linear time and constant memory — no quadratic wall, no growing KV cache. Classical SSMs had a FIXED update (content-independent); Mamba's selectivity makes the update DEPEND ON THE INPUT, so the model can choose what to remember or forget dynamically — closing much of the quality gap with attention. The core trade-off: attention keeps every past token for precise RANDOM ACCESS (perfect recall, but O(T²) and growing cache), while an SSM COMPRESSES the past into a fixed state (cheap and constant memory, but may lose specific distant details). Random access vs compression.

Exercise 10Pen & Paper

Explain external memory (MemGPT) via the OS RAM/disk analogy; relation to RAG; when over a bigger window?

Solution

MemGPT treats the context window like RAM and an external store like disk: the model manages a small fast context and 'pages' information in and out of a large external memory as needed, giving effectively unbounded memory with a finite window. It relates to RAG (Chapter 29) — both retrieve relevant information into the context on demand — but external memory retrieves from the model's OWN history/working memory rather than a knowledge base. Use it over a bigger window when the information is effectively unbounded (an endless conversation, a huge corpus) that no window could hold, or when keeping the context small is far cheaper than processing a giant one.

Exercise 11Code

Measure attention's quadratic cost empirically; plot time vs T.

Solution

Timing a full-attention forward pass at increasing T and plotting on log-log axes yields a slope of ~2 (each doubling of T roughly quadruples the time) — the empirical confirmation of the O(T²) cost (Exercise 2) that motivates every technique in the chapter.

Exercise 12Code

Implement RoPE; verify the attention score depends only on relative distance.

Solution

Rotating Q and K by position-dependent angles and checking that scores between positions m and n depend only on m−n (shifting both positions leaves the score unchanged) verifies RoPE's relative-position property (Exercise 4) — the structure that makes extension possible.

Exercise 13Code

Implement position interpolation; test whether it extends beyond the trained length.

Solution

Scaling positions to squeeze a longer range into the trained range (Exercise 5) and testing on sequences beyond the original length shows the model now produces sensible output where naive extrapolation failed — demonstrating interpolation as the key to cheap context extension.

Exercise 14Code

Implement YaRN-style per-frequency scaling; compare long-context perplexity to linear interpolation.

Solution

Applying per-frequency (NTK/YaRN) scaling and measuring perplexity on long sequences typically beats uniform linear interpolation (Exercise 6), because preserving local positional precision while extending range yields better long-context modeling — the empirical edge of YaRN.

Exercise 15Code

Implement sliding-window + global attention; compare cost and quality to full attention.

Solution

A local window plus a few global tokens (Exercise 7) reduces attention cost markedly while retaining most quality on long-sequence tasks, since most dependencies are local and the global tokens carry long-range information — demonstrating sparse attention's speed/completeness trade-off.

Exercise 16Code

Implement linear attention; verify linear scaling; compare quality to full attention.

Solution

The reordered φ(Q)(φ(K)ᵀV) form (Exercise 8) scales linearly with T (confirmed by timing), at some quality cost versus exact softmax attention on a toy task — illustrating the efficiency-vs-fidelity trade of approximating attention with a kernel.

Exercise 17Code Lab

Implement a minimal selective SSM (input-dependent state update); process a long sequence; compare memory to attention.

Solution

A fixed-size state with an input-dependent update (Exercise 9) processes a long sequence in linear time with CONSTANT memory per step — versus attention's growing KV cache. Comparing memory use demonstrates the SSM's efficiency advantage, and its selectivity lets it choose what to retain.

Exercise 18Code Lab

Build a needle-in-a-haystack test; measure retrieval accuracy by position; show lost-in-the-middle.

Solution

Hiding a fact at various positions in a long context and measuring whether the model retrieves it reproduces the lost-in-the-middle effect (Chapter 29): accuracy is high at the start and end and dips in the middle — showing that a large window does not guarantee uniform use of all of it.

Exercise 19Code

Build a MemGPT-style external memory; retrieve distant history into a small window; answer a referencing question.

Solution

Storing conversation history externally and retrieving relevant pieces into a small context on demand (Exercise 10) lets the model answer a question referencing information from far earlier than the window holds — demonstrating effectively unbounded memory via retrieval, the paging analogy in code.

Exercise 20Code (Challenge)

Extend a short-context model end-to-end (YaRN + fine-tune + sparse attention + KV management); evaluate needle-in-a-haystack and multi-hop vs a RAG baseline.

Solution

Combining YaRN position scaling, brief long-sequence fine-tuning, sliding-window+global sparse attention, and KV-cache management produces a working long-context model. Evaluating needle-in-a-haystack (across positions) and a multi-hop reasoning task, and comparing to a RAG baseline, reveals the trade-offs: long context wins when information is bounded and reasoning must span it all; RAG wins on cost and unbounded knowledge; and lost-in-the-middle hurts long context where key facts sit in the middle — the integrated lesson of the chapter.

←

ReturnAppendix Index

ReviewBack to Chapter 33

→