Solutions Appendix
Chapter 29

Retrieval-Augmented Generation

22 Solutions

Detailed solutions for the exercises in Chapter 29. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper
Problems RAG solves that a model alone can't; the open-book-exam analogy; why retrieval quality dominates.

Solution

RAG solves: stale knowledge (post-cutoff facts), private/enterprise data the model never trained on, hallucination (grounding answers in sources), and provenance (citations). Open-book-exam analogy: instead of memorizing everything (closed-book, the model alone), the model looks up relevant passages and answers from them (open book). Retrieval quality dominates because if the right passage isn't retrieved, even a perfect model cannot answer correctly — garbage in, garbage out; the answer can only be as good as the retrieved context.

Exercise 2Pen & Paper
Offline and online phases of RAG; why suspect retrieval first when answers are wrong?

Solution

Offline (indexing): documents are chunked, embedded, and stored in a vector index. Online (query time): the query is embedded, the top-k chunks retrieved, inserted into the prompt, and the model generates a grounded answer. When answers are wrong, suspect RETRIEVAL first because the most common failure is that the relevant chunk wasn't retrieved — if the context lacks the answer, the generation step cannot succeed. Check recall before blaming the model.

Exercise 3Pen & Paper
Explain dense retrieval; why can it match text sharing no words; what does cosine measure?

Solution

Dense retrieval embeds query and documents into vectors capturing MEANING, then retrieves by vector similarity. It can match text sharing no words because semantically similar phrases ('car' vs 'automobile') map to nearby vectors regardless of surface form. Cosine similarity measures the angle between embeddings — how aligned their semantic directions are — so it scores conceptual relatedness, not lexical overlap (unlike keyword search).

Exercise 4Pen & Paper
Dense vs sparse (keyword) retrieval; a query where each wins and one where you want both.

Solution

Dense wins on paraphrase/synonymy ('how to fix a flat' → 'repairing a punctured tire'). Sparse (BM25) wins on exact terms, rare identifiers, or codes ('error E1738', a product SKU, a person's exact name) where embeddings may blur specifics. You want BOTH (hybrid) for queries mixing concepts and exact terms — e.g. 'side effects of drug XR-450' needs semantic understanding of 'side effects' AND exact matching of 'XR-450'.

Exercise 5Pen & Paper
Why is exact vector search too slow at scale? Explain ANN and the recall/speed trade-off.

Solution

Exact (brute-force) search compares the query to EVERY vector — O(N) per query, infeasible for millions/billions of vectors. Approximate Nearest Neighbor (ANN) methods (HNSW, IVF) build an index that finds the likely-nearest vectors without checking all, in roughly logarithmic time. The trade-off: ANN may miss some true neighbors (recall < 100%) in exchange for huge speedups; tuning the index trades recall against latency — you accept slightly imperfect retrieval for tractable speed.

Exercise 6Pen & Paper
Chunking trade-off (too small vs too large); why overlap helps; why natural boundaries beat fixed-size.

Solution

Too-small chunks lose context (a sentence without its surroundings is ambiguous); too-large chunks dilute relevance (the embedding averages many topics, and irrelevant text crowds the context). Overlap helps because an answer spanning a chunk boundary is preserved in at least one chunk. Natural-boundary chunking (by paragraph/section) beats fixed-size because it keeps semantically coherent units intact, producing cleaner embeddings and more self-contained retrieved passages.

Exercise 7Pen & Paper
Explain hybrid search and reciprocal rank fusion; why does RRF use ranks not raw scores?

Solution

Hybrid search runs both dense and sparse retrieval and combines their results. Reciprocal Rank Fusion (RRF) scores each document by Σ 1/(k + rank) across the retrievers and re-sorts. RRF uses RANKS rather than raw scores because dense (cosine) and sparse (BM25) scores live on incomparable scales — fusing raw scores would let one method's larger numeric range dominate. Ranks are comparable across methods, so RRF combines them fairly without score normalization.

Exercise 8Pen & Paper
Explain reranking; why is a cross-encoder more accurate than a bi-encoder; why only top candidates?

Solution

Reranking re-scores the retrieved candidates with a more powerful model to reorder them. A bi-encoder embeds query and document SEPARATELY (fast, enables indexing) but never lets them interact. A cross-encoder feeds query and document TOGETHER through a Transformer, so attention models their fine-grained interaction — far more accurate, but too slow to run over the whole corpus. So it is applied only to the top candidates from first-stage retrieval: cheap retrieval narrows to dozens, then the expensive cross-encoder picks the best few.

Exercise 9Pen & Paper
Describe lost-in-the-middle and two mitigations; why is more context not always better?

Solution

Lost-in-the-middle: models attend best to the start and end of the context and worst to the middle, so a relevant chunk buried in the middle may be ignored. Mitigations: (1) reorder retrieved chunks to place the most relevant at the start/end; (2) retrieve fewer, higher-quality chunks (rerank and trim). More context is not always better because adding marginally-relevant chunks pushes key information into the neglected middle and distracts the model — a focused context often beats a larger one.

Exercise 10Pen & Paper
Compare RAG, fine-tuning, long context for injecting knowledge; when is each best; how do RAG and fine-tuning combine?

Solution

RAG: best for large, changing, or private knowledge needing citations — update the index, not the model. Fine-tuning: best for teaching STYLE, FORMAT, or skills (behavior), not volatile facts. Long context: best when the relevant info is bounded and fits the window, and you want simplicity. They combine well: fine-tune the model to use retrieved context effectively and adopt the right style, while RAG supplies the up-to-date facts — fine-tuning for HOW, RAG for WHAT.

Exercise 11Code
Implement dense retrieval from scratch: embed chunks and query, return top-k by cosine.

Solution

Embedding the chunks and query with a sentence encoder, computing cosine similarities, and returning the top-k implements the core retrieval step (Exercise 3) — the minimal RAG retriever, matching meaning rather than keywords.

Exercise 12Code
Build a FAISS index; compare Flat (exact) to HNSW (approximate) on speed and recall.

Solution

A Flat index gives exact search but scales linearly; HNSW gives near-exact results far faster. Comparing them shows HNSW achieving large speedups at slightly reduced recall — the ANN trade-off of Exercise 5 made concrete.

Exercise 13Code
Implement three chunking strategies; compare retrieval quality.

Solution

Comparing fixed-size, fixed-with-overlap, and sentence-based chunking on a document set typically shows overlap and natural-boundary strategies retrieving more relevant, self-contained chunks (Exercise 6) — demonstrating that chunking choices materially affect retrieval quality.

Exercise 14Code
Implement BM25; combine with dense via RRF; show a query where hybrid wins.

Solution

BM25 captures exact-term matches; fusing it with dense retrieval via RRF (Exercise 7) recovers queries that pure dense misses (rare identifiers) and pure sparse misses (paraphrases). A query with both an exact code and a paraphrased concept shows hybrid beating either alone (Exercise 4).

Exercise 15Code
Add a cross-encoder reranker (retrieve 50, rerank to 5); compare to raw top-5.

Solution

Retrieving 50 candidates cheaply then reranking with a cross-encoder to the top 5 (Exercise 8) yields more relevant final passages than retrieval's raw top 5, because the cross-encoder models query-document interaction — the accuracy gain that justifies the two-stage retrieve-then-rerank design.

Exercise 16Code
Build a grounded RAG prompt with citations; show 'say I don't know' prevents hallucination when no chunk is relevant.

Solution

Instructing the model to answer only from the retrieved context, cite sources, and say 'I don't know' when the context lacks the answer prevents it from fabricating when retrieval fails — demonstrating that grounding instructions, plus an abstention clause, curb hallucination (Exercise 1).

Exercise 17Code
Demonstrate lost-in-the-middle: answer chunk at start/middle/end of a long context.

Solution

Placing the answer-bearing chunk at different positions and measuring whether the model uses it reproduces the lost-in-the-middle effect (Exercise 9): accuracy is highest when the chunk is at the start or end and drops in the middle — motivating relevance-aware context ordering.

Exercise 18Code Lab
Build a complete basic RAG system (chunk→embed→index→retrieve→ground→generate); inspect retrieved chunks.

Solution

Assembling the full offline+online pipeline (Exercise 2) and answering questions while inspecting the retrieved chunks shows the system grounding its answers in real passages — and makes debugging easy, since you can see whether a wrong answer came from bad retrieval or bad generation.

Exercise 19Code Lab
Implement RAG evaluation: retrieval recall@k and answer faithfulness, measured separately.

Solution

Measuring recall@k (did the right chunk get retrieved?) and faithfulness (does the answer only claim what sources support?) SEPARATELY localizes failures — low recall is a retrieval problem, low faithfulness with good recall is a generation problem (Exercise 2). Separate metrics are essential for diagnosing and improving RAG.

Exercise 20Code
Implement query rewriting; show it improves recall on under-specified questions.

Solution

Using the model to expand or clarify a terse query before retrieval (adding context/synonyms) improves recall on under-specified questions — the retriever gets a richer query to match against, surfacing relevant chunks the original terse query missed.

Exercise 21Code Lab
Build agentic RAG: give the model a retrieval tool and let it decide when/what to retrieve; test multi-hop.

Solution

Treating retrieval as a tool (Chapter 28) lets the model decide when and what to search, issuing multiple retrievals for a multi-hop question (retrieve fact A, then use it to retrieve fact B). This adaptive retrieval handles questions a single fixed retrieval cannot — the bridge from RAG to agents.

Exercise 22Code (Challenge)
Full production RAG (overlap+natural chunking, hybrid+RRF, cross-encoder rerank, middle-aware ordering, grounded citations); evaluate and ablate each stage.

Solution

Building the full pipeline and then DEGRADING each stage in turn (bad chunking, no reranking, no hybrid) while measuring recall and faithfulness quantifies each component's contribution — typically reranking and hybrid retrieval give large gains, and good chunking is foundational. The ablation shows RAG quality is the product of many stages, each worth getting right (Exercise 2's 'suspect retrieval first' at system scale).