Embeddings & Representation Learning
Detailed solutions for the exercises in Chapter 8. Try solving them yourself before checking the answers.
Solution
Two distinct one-hot vectors have their single 1s in different positions, so their dot product is 0 → cosine 0 (orthogonal). Each has L2 norm 1, and their squared Euclidean distance is ‖1‖1‖² = 1 + 1 − 2·0 = 2, so the distance is √2 regardless of which words. This is the core defect of one-hot encoding: every pair of distinct words is equidistant and unrelated — no notion of similarity.
Solution
Full softmax costs O(V·d) = 10⁶·300 per pair (normalizing over the whole vocabulary). SGNS computes only the positive plus k negatives: O((k+1)·d) = 6·300. The speedup is V/(k+1) = 10⁶/6 ≈ 166,667×. Negative sampling replaces a vocabulary-sized normalization with a handful of binary classifications — the trick that made word2vec scalable.
Solution
With J = logσ(u_c·v_w) + Σ_n logσ(−u_n·v_w):
∂J/∂u_c = (1−σ(u_c·v_w))·v_w and ∂J/∂u_n = −σ(u_n·v_w)·v_w. The positive context is pushed toward v_w with weight (1−σ) (its 'error'), and each negative is pushed away with weight σ — a contrastive pull-together/push-apart, the ancestor of modern contrastive learning.
Solution
Levy & Goldberg (2014) showed the optimal SGNS solution satisfies u_c·v_w = PMI(w,c) − log k. So SGNS implicitly factorizes the (shifted) pointwise-mutual-information matrix into the word and context embedding matrices. The number of negatives k sets a global downward shift of that matrix: larger k subtracts more, sparsifying which pairs end up with positive association — effectively a threshold on how much co-occurrence counts as 'related'.
Solution
The probability ratio is (10⁶/10²)^α = 10^{4α}. At α=1.0 the ratio is 10⁴ (frequent words sampled 10,000× more); at α=0.75 it is 10³; at α=0 it is 1 (uniform — most equal). The 0.75 exponent is a compromise: it down-weights very frequent words relative to α=1 (so rare words get more representation as negatives) without going all the way to uniform, which empirically gives the best embeddings.
Solution
GloVe's weighted least-squares loss uses a weighting function f(X_ij) with f(0)=0, so pairs that never co-occur contribute zero to the loss — and the problematic log(X_ij)=log(0) is never evaluated. This is a feature, not a bug: most word pairs never co-occur, and forcing the model to fit those (uninformative) zeros would waste capacity and destabilize training. f(0)=0 cleanly skips them.
Solution
With boundary markers the token is
Solution
Standard analogy evaluation excludes the three input words (king, man, woman) from the candidate answers. So even though the raw nearest neighbor of the computed vector is often king itself, it is masked out, and the next-nearest — queen — is returned. This reveals that the parallelogram structure is approximate: the analogy offset is small relative to the dominant 'king-ness', so the vector lands near king, and the gender shift only becomes decisive once the inputs are excluded.
Solution
Small context windows (±2) capture words that are mutually substitutable — 'Hogwarts' and 'Sunnydale' appear in the same syntactic slots (fictional school names), so they end up near each other: this is paradigmatic SIMILARITY. Large windows (±10) capture topical co-occurrence — 'Hogwarts' appears in passages with 'Dumbledore' and 'wizard', so they cluster: this is RELATEDNESS. Window size tunes the similarity–relatedness dial.
Solution
With v₁ ⊥ v₂ and unit norms, v·v₂ = 0.1 and |v| = √(0.9²+0.1²) = √0.82 ≈ 0.906, so cos(v, v₂) = 0.1/0.906 ≈ 0.110. The rare sense has very low cosine with the blended vector, so a single static embedding effectively buries low-frequency senses under the dominant one — which is exactly why contextual embeddings (BERT) were needed to separate senses (Exercise 18).
Solution
Count word co-occurrences within a window, weight (e.g. PPMI), and take the top-50 left singular vectors as embeddings. Nearest neighbors by cosine reveal semantically related words (e.g. for 'sea': 'ocean', 'waves', 'ship'). LSA shows that even a simple count-matrix factorization recovers meaningful semantic geometry — the classical precursor to word2vec/GloVe.
Solution
Implement the forward softmax over the small vocabulary and the analytic gradients from Exercise 3 (full-softmax version), then confirm them against the Chapter-5 finite-difference checker (agreement ~1e−5). This validates the embedding gradients before scaling up with negative sampling.
Solution
With full softmax, epoch time grows linearly with vocabulary V (the normalization). With negative sampling it stays roughly flat (cost depends on k, not V). The two curves diverge sharply as V grows — the empirical demonstration of the ~V/(k+1) speedup derived in Exercise 2.
Solution
Build the co-occurrence counts, then minimize the weighted least-squares objective Σ f(X_ij)(w_i·w̃_j + b_i + b̃_j − log X_ij)² with AdaGrad. After 25 epochs the 50-d vectors solve a fair fraction of the 20 hand-picked analogies (capital-country, plural, gender), demonstrating the linear-structure property emerging from co-occurrence factorization.
Solution
Because fastText composes a word vector from its character n-grams (Exercise 7), it produces sensible vectors for words never seen in training and for typos (which share most n-grams with the correct spelling). Querying neighbors of unseen/misspelled words returns morphologically and semantically related words — something word2vec/GloVe (which have no vector for OOV tokens) cannot do.
Solution
Raw embeddings are anisotropic: random pairs have high mean cosine (vectors share a few dominant directions). 'All-but-the-top' removes the top k principal components, recentering the space; this typically improves the correlation with human similarity judgments (Spearman ρ on WordSim-353). A small k (around 2–3) usually gives the best ρ — removing the dominant nuisance directions sharpens semantic geometry.
Solution
PCA (linear) preserves global structure but often overlaps categories in 2-D; t-SNE (nonlinear, neighborhood-preserving) produces tighter, well-separated clusters for the 6 categories. t-SNE usually shows clearer cluster structure, though it distorts global distances — the standard trade-off between PCA's faithfulness to global geometry and t-SNE's local cluster separation.
Solution
Contextual BERT embeddings of the target word cluster cleanly by sense (river vs finance), so k-means (k=2) achieves high cluster purity — demonstrating that contextual models separate senses that static embeddings conflate. Averaging all 20 contextual vectors and finding its nearest static GloVe word usually lands on the dominant sense's vocabulary, echoing the superposition result of Exercise 10.