Solutions Appendix
Chapter 8

Embeddings & Representation Learning

18 Solutions

Detailed solutions for the exercises in Chapter 8. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper
Show cosine between distinct one-hot vectors is 0 and their Euclidean distance is always √2.

Solution

Two distinct one-hot vectors have their single 1s in different positions, so their dot product is 0 → cosine 0 (orthogonal). Each has L2 norm 1, and their squared Euclidean distance is ‖1‖1‖² = 1 + 1 − 2·0 = 2, so the distance is √2 regardless of which words. This is the core defect of one-hot encoding: every pair of distinct words is equidistant and unrelated — no notion of similarity.

Exercise 2Pen & Paper
Compute the exact SGNS speedup over full softmax for V=10⁶, d=300, k=5.

Solution

Full softmax costs O(V·d) = 10⁶·300 per pair (normalizing over the whole vocabulary). SGNS computes only the positive plus k negatives: O((k+1)·d) = 6·300. The speedup is V/(k+1) = 10⁶/6 ≈ 166,667×. Negative sampling replaces a vocabulary-sized normalization with a handful of binary classifications — the trick that made word2vec scalable.

Exercise 3Pen & Paper
Derive the gradients of the SGNS objective w.r.t. v_w, u_c, and each u_n.

Solution

With J = logσ(u_c·v_w) + Σ_n logσ(−u_n·v_w):

∂J/∂v_w = (1−σ(u_c·v_w))u_c − Σ_n σ(u_n·v_w)u_n

∂J/∂u_c = (1−σ(u_c·v_w))·v_w and ∂J/∂u_n = −σ(u_n·v_w)·v_w. The positive context is pushed toward v_w with weight (1−σ) (its 'error'), and each negative is pushed away with weight σ — a contrastive pull-together/push-apart, the ancestor of modern contrastive learning.

Exercise 4Pen & Paper
Show SGNS implicitly factorizes M = PMI − log k. What role does k play?

Solution

Levy & Goldberg (2014) showed the optimal SGNS solution satisfies u_c·v_w = PMI(w,c) − log k. So SGNS implicitly factorizes the (shifted) pointwise-mutual-information matrix into the word and context embedding matrices. The number of negatives k sets a global downward shift of that matrix: larger k subtracts more, sparsifying which pairs end up with positive association — effectively a threshold on how much co-occurrence counts as 'related'.

Exercise 5Pen & Paper
Compare sampling probabilities of count-10⁶ vs count-10² words under exponents 1.0, 0.75, 0.

Solution

The probability ratio is (10⁶/10²)^α = 10^{4α}. At α=1.0 the ratio is 10⁴ (frequent words sampled 10,000× more); at α=0.75 it is 10³; at α=0 it is 1 (uniform — most equal). The 0.75 exponent is a compromise: it down-weights very frequent words relative to α=1 (so rare words get more representation as negatives) without going all the way to uniform, which empirically gives the best embeddings.

Exercise 6Pen & Paper
In GloVe, what happens to a pair with X_ij=0? Why is this a feature?

Solution

GloVe's weighted least-squares loss uses a weighting function f(X_ij) with f(0)=0, so pairs that never co-occur contribute zero to the loss — and the problematic log(X_ij)=log(0) is never evaluated. This is a feature, not a bug: most word pairs never co-occur, and forcing the model to fit those (uninformative) zeros would waste capacity and destabilize training. f(0)=0 cleanly skips them.

Exercise 7Pen & Paper
How many 3-to-6-grams does fastText extract from 'learning'? List the 3-grams.

Solution

With boundary markers the token is (10 characters). The character 3-grams are: — 8 of them. fastText also extracts all 4-, 5-, and 6-grams plus the full token , summing to roughly 8+7+6+5+1 = 27 subword features. Representing a word as the sum of its subword vectors is what gives fastText its morphology-awareness and out-of-vocabulary handling (Exercise 15).

Exercise 8Pen & Paper
If v(king)−v(man)+v(woman) is closest to v(king), how does analogy eval still report 'queen'?

Solution

Standard analogy evaluation excludes the three input words (king, man, woman) from the candidate answers. So even though the raw nearest neighbor of the computed vector is often king itself, it is masked out, and the next-nearest — queen — is returned. This reveals that the parallelogram structure is approximate: the analogy offset is small relative to the dominant 'king-ness', so the vector lands near king, and the gender shift only becomes decisive once the inputs are excluded.

Exercise 9Pen & Paper
Why does a small window put 'Hogwarts' near 'Sunnydale' but a large window near 'Dumbledore'/'wizard'?

Solution

Small context windows (±2) capture words that are mutually substitutable — 'Hogwarts' and 'Sunnydale' appear in the same syntactic slots (fictional school names), so they end up near each other: this is paradigmatic SIMILARITY. Large windows (±10) capture topical co-occurrence — 'Hogwarts' appears in passages with 'Dumbledore' and 'wizard', so they cluster: this is RELATEDNESS. Window size tunes the similarity–relatedness dial.

Exercise 10Pen & Paper
Polysemy superposition v=0.9v₁+0.1v₂ (orthonormal). Compute cos(v, v₂). Implication?

Solution

With v₁ ⊥ v₂ and unit norms, v·v₂ = 0.1 and |v| = √(0.9²+0.1²) = √0.82 ≈ 0.906, so cos(v, v₂) = 0.1/0.906 ≈ 0.110. The rare sense has very low cosine with the blended vector, so a single static embedding effectively buries low-frequency senses under the dominant one — which is exactly why contextual embeddings (BERT) were needed to separate senses (Exercise 18).

Exercise 11Code
Build a co-occurrence matrix from a public-domain book; LSA via truncated SVD (k=50); show neighbors.

Solution

Count word co-occurrences within a window, weight (e.g. PPMI), and take the top-50 left singular vectors as embeddings. Nearest neighbors by cosine reveal semantically related words (e.g. for 'sea': 'ocean', 'waves', 'ship'). LSA shows that even a simple count-matrix factorization recovers meaningful semantic geometry — the classical precursor to word2vec/GloVe.

Exercise 12Code
Implement Skip-gram with full softmax on a toy corpus; verify gradients via finite differences.

Solution

Implement the forward softmax over the small vocabulary and the analytic gradients from Exercise 3 (full-softmax version), then confirm them against the Chapter-5 finite-difference checker (agreement ~1e−5). This validates the embedding gradients before scaling up with negative sampling.

Exercise 13Code
Add negative sampling (k=5); plot per-epoch time vs vocabulary 100→5000.

Solution

With full softmax, epoch time grows linearly with vocabulary V (the normalization). With negative sampling it stays roughly flat (cost depends on k, not V). The two curves diverge sharply as V grows — the empirical demonstration of the ~V/(k+1) speedup derived in Exercise 2.

Exercise 14Code Lab
Implement GloVe from scratch on text8 (10MB); 50-d, 25 epochs; evaluate on 20 analogies.

Solution

Build the co-occurrence counts, then minimize the weighted least-squares objective Σ f(X_ij)(w_i·w̃_j + b_i + b̃_j − log X_ij)² with AdaGrad. After 25 epochs the 50-d vectors solve a fair fraction of the 20 hand-picked analogies (capital-country, plural, gender), demonstrating the linear-structure property emerging from co-occurrence factorization.

Exercise 15Code
Use fastText to demonstrate OOV handling: query 10 unseen words including 2 typos.

Solution

Because fastText composes a word vector from its character n-grams (Exercise 7), it produces sensible vectors for words never seen in training and for typos (which share most n-grams with the correct spelling). Querying neighbors of unseen/misspelled words returns morphologically and semantically related words — something word2vec/GloVe (which have no vector for OOV tokens) cannot do.

Exercise 16Code Lab
Measure anisotropy of GloVe-100d; apply all-but-the-top with k∈{1,3,5,10}; Spearman ρ on WordSim-353.

Solution

Raw embeddings are anisotropic: random pairs have high mean cosine (vectors share a few dominant directions). 'All-but-the-top' removes the top k principal components, recentering the space; this typically improves the correlation with human similarity judgments (Spearman ρ on WordSim-353). A small k (around 2–3) usually gives the best ρ — removing the dominant nuisance directions sharpens semantic geometry.

Exercise 17Code
Visualize 60 words from 6 categories with PCA vs t-SNE; which preserves clusters better?

Solution

PCA (linear) preserves global structure but often overlaps categories in 2-D; t-SNE (nonlinear, neighborhood-preserving) produces tighter, well-separated clusters for the 6 categories. t-SNE usually shows clearer cluster structure, though it distorts global distances — the standard trade-off between PCA's faithfulness to global geometry and t-SNE's local cluster separation.

Exercise 18Code (Challenge)
Polysemy: BERT embeddings for two senses of 'bank'; k-means purity; nearest static GloVe neighbor.

Solution

Contextual BERT embeddings of the target word cluster cleanly by sense (river vs finance), so k-means (k=2) achieves high cluster purity — demonstrating that contextual models separate senses that static embeddings conflate. Averaging all 20 contextual vectors and finding its nearest static GloVe word usually lands on the dominant sense's vocabulary, echoing the superposition result of Exercise 10.