Solutions Appendix
Chapter 30

Multimodal LLMs

20 Solutions

Detailed solutions for the exercises in Chapter 30. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper
Core idea that makes multimodal models possible; why is the Transformer 'modality-agnostic'?

Solution

The core idea: convert ANY modality into a sequence of TOKENS (vectors), and the Transformer processes them identically — it operates on sequences of embeddings regardless of where they came from. The Transformer is 'modality-agnostic' because attention and FFNs act on vectors without caring whether they encode text, image patches, or audio frames. So if you can tokenize a modality into the model's embedding space, the same architecture handles it — the key enabler of multimodal models.

Exercise 2Pen & Paper
How is an image turned into tokens (ViT patches)? For 336×336 with 14×14 patches, how many tokens?

Solution

A Vision Transformer splits the image into fixed-size patches, flattens each patch, and linearly projects it into an embedding — each patch becomes a token. For a 336×336 image with 14×14 patches: (336/14)×(336/14) = 24×24 = 576 patch tokens. The image becomes a 576-token sequence the Transformer can process like text.

Exercise 3Pen & Paper
Why do patch tokens need position embeddings? What goes wrong without them?

Solution

Like text tokens, patch tokens are processed by permutation-equivariant attention (Chapter 13), which is blind to order — so without position embeddings the model cannot tell WHERE each patch sits in the image (top-left vs bottom-right). Spatial relationships (above, beside, the layout of a scene) would be lost, making the model unable to reason about image structure. Position embeddings restore the 2-D spatial arrangement.

Exercise 4Pen & Paper
Explain CLIP's contrastive objective; why does it produce a shared image-text space?

Solution

CLIP trains an image encoder and a text encoder so that matching (image, caption) pairs have high cosine similarity and mismatched pairs have low similarity — a contrastive loss over a batch (InfoNCE, Chapter 4). Because the objective pulls each image's embedding toward its caption's embedding (and pushes apart non-matches), both encoders learn to map images and text into the SAME space where semantically corresponding items align — a shared image-text embedding space enabling cross-modal comparison.

Exercise 5Pen & Paper
Why are CLIP vision features well-suited as LLM input vs raw pixels?

Solution

CLIP's vision features are already SEMANTIC — trained against language, they encode high-level meaning aligned with text concepts, not low-level pixels. Feeding raw pixels would force the LLM to learn vision from scratch; feeding CLIP features gives it representations that already 'speak the language' of concepts, so a small projector can bridge them into the LLM's space. CLIP features are semantically rich and text-aligned, making them an ideal interface.

Exercise 6Pen & Paper
Describe the LLaVA architecture (encoder, projector, LLM); role of the projector; why small?

Solution

LLaVA = a frozen vision encoder (e.g. CLIP) → a projector (small MLP) → a pretrained LLM. The projector maps the vision encoder's feature vectors into the LLM's token-embedding space, so image features become 'tokens' the LLM can attend to alongside text. It can be small because both the encoder (semantic vision features) and the LLM (language understanding) are already powerful and frozen — the projector only needs to learn the linear-ish mapping between two existing spaces, a comparatively easy job.

Exercise 7Pen & Paper
Explain the two-stage VLM training; what is frozen/trained in each and why?

Solution

Stage 1 (alignment): freeze the vision encoder AND the LLM, train ONLY the projector on image-caption pairs — cheaply learning to map vision features into the LLM's space without disturbing either pretrained component. Stage 2 (instruction tuning): unfreeze the LLM (and sometimes the projector), train on image-question-answer data so the model learns to USE the visual tokens to follow instructions. Stage 1 builds the bridge; Stage 2 teaches multimodal task behavior — freezing first protects the expensive pretrained weights while the cheap projector aligns.

Exercise 8Pen & Paper
Explain the image-token budget; why are high-res images expensive; how does a resampler help?

Solution

Each image becomes many patch tokens (Exercise 2), which consume context and compute like text tokens; high-resolution images produce far more patches (cost grows with resolution), and attention is quadratic in total tokens. A resampler (e.g. a Perceiver/Q-Former) compresses the many patch tokens into a small FIXED number of query tokens via cross-attention, so the LLM sees a constant, manageable number of visual tokens regardless of image resolution — decoupling cost from resolution.

Exercise 9Pen & Paper
Show the audio pipeline mirrors vision; what fundamentally changes?

Solution

Audio: convert the waveform to a spectrogram (a 2-D time-frequency image), patchify it, and feed the patches as tokens — exactly mirroring the vision pipeline (patchify → embed → Transformer). The only fundamental change is the FRONT-END that turns the raw signal into a 2-D representation (spectrogram for audio vs pixel grid for images); after that, the tokenize-and-process machinery is the same — underscoring the modality-agnostic point of Exercise 1.

Exercise 10Pen & Paper
Why does a SHARED embedding space (not separate per-modality) enable cross-modal reasoning? Example.

Solution

A shared space lets tokens from different modalities be compared and combined directly by attention — the model can relate a word to an image region because both live in the same space. Separate per-modality spaces would require translation at every interaction. Example requiring it: answering 'What color is the car in the image?' needs the text token 'car' to attend to and align with the visual patches depicting the car — only possible if words and patches share a space where 'car-ness' is comparable across modalities.

Exercise 11Code
Implement patchify: image tensor → sequence of flattened patch tokens; verify token counts.

Solution

Reshaping an image into non-overlapping patches and flattening each gives the patch-token sequence; verifying the count = (H/p)×(W/p) for various sizes confirms Exercise 2 (e.g. 576 for 336×336 with 14×14) — the first step of turning images into Transformer input.

Exercise 12Code
Build a minimal ViT encoder (patch embed + position embed + blocks); get patch features.

Solution

Embedding patches, adding position embeddings (Exercise 3), and passing through a few Transformer blocks yields contextualized patch features — a working vision encoder demonstrating that the text-Transformer machinery applies unchanged to image patches.

Exercise 13Code
Implement CLIP's contrastive loss; verify it pulls matches together and pushes mismatches apart.

Solution

Computing the image-text similarity matrix for a batch and applying the symmetric InfoNCE loss (Exercise 4) and checking that training raises matched-pair similarity and lowers mismatched-pair similarity confirms the contrastive objective builds the shared space.

Exercise 14Code
Use pretrained CLIP for zero-shot classification: score an image against text labels.

Solution

Embedding candidate label strings and the image, then picking the label with highest cosine similarity, classifies without any task-specific training — zero-shot classification, demonstrating the power of CLIP's shared image-text space (Exercise 5).

Exercise 15Code
Implement the LLaVA projector (MLP) mapping vision features to the LLM's embedding dim; confirm shapes.

Solution

A small MLP projecting vision features to the LLM's hidden size (Exercise 6) produces vectors shaped like token embeddings; confirming the output dimension matches the LLM's embeddings verifies the projector correctly bridges the two spaces.

Exercise 16Code Lab
Connect a frozen vision encoder to a small LLM via the projector; run a combined image+text sequence.

Solution

Projecting image features into token embeddings and concatenating them with text token embeddings, then running the combined sequence through the LLM, produces a model that attends jointly over image and text — the LLaVA forward pass (Exercise 6) assembled.

Exercise 17Code Lab
Implement Stage-1 alignment: freeze encoder and LLM, train ONLY the projector on captions.

Solution

Training just the projector (everything else frozen) on image-caption pairs (Exercise 7) teaches the model to describe images — a cheap alignment stage that bridges vision to language without touching the expensive pretrained weights, demonstrating the two-stage recipe's efficiency.

Exercise 18Code
Implement a resampler compressing N patch tokens into K query tokens; show token count is resolution-independent.

Solution

A cross-attention resampler with K learned queries attends over the N patch tokens, outputting exactly K tokens regardless of N (Exercise 8). Showing the LLM's visual-token count stays K as image resolution (and thus N) varies demonstrates how resamplers cap the image-token budget.

Exercise 19Code
Build an audio tokenization pipeline: waveform → spectrogram → patchify → audio tokens.

Solution

Converting a waveform to a spectrogram and patchifying it like an image (Exercise 9) produces audio tokens for the LLM — demonstrating that, after the spectrogram front-end, audio reuses the entire vision pipeline, the modality-agnostic principle in action.

Exercise 20Code (Challenge)
Build a minimal end-to-end VLM: vision encoder + trainable projector + small LLM; Stage 1 then Stage 2; probe fine-detail failures.

Solution

Training the projector on captions (Stage 1) then instruction-tuning on image-QA (Stage 2) yields a working VLM that answers held-out visual questions. Probing fine-detail tasks (counting many objects, reading small text) reveals failures explained by patch tokenization and the token budget: small details fall below patch resolution or get averaged within a patch, and the limited visual-token count discards fine information — connecting the model's limits directly to Exercises 2 and 8.