Part II: Classical ML & Representations
Chapter 8

Sequence Models & RNNs

Recurrent networks, LSTMs, and sequence-to-sequence
18 Exercises
8.1

How should a computer represent the word “cat”? The obvious answer — a one-hot vector with a 1 in position 8,432 of a 50,000-dimensional vector — turns out to be nearly useless. This section explains why, and motivates the dense representations that power all of modern NLP.

The Problem with One-Hot Vectors

One-hot encodingDense embedding
Dimension = vocabulary size (50k–200k)Dimension = 50–1024 (chosen freely)
Every pair of words is orthogonalSimilar words are close in space
cos(cat, kitten) = 0 = cos(cat, carburettor)cos(cat, kitten) ≈ 0.8 >> cos(cat, carburettor)
Sparse: 1 non-zero entryDense: every dimension carries information
No generalisation between wordsStatistical strength shared across similar words
Parameters grow with vocabularyParameters fixed by embedding dim

The fatal flaw of one-hot vectors: every pair of distinct words has dot product zero. The encoding contains no information about meaning whatsoever. A model trained on “the cat sat on the mat” learns nothing that transfers to “the kitten sat on the rug”, even though the sentences are nearly synonymous.

PythonOne-hot vs. dense — the generalisation gap
import numpy as np

# One-hot: every word pair is orthogonal
V = 50000                       # vocabulary size
cat     = np.zeros(V); cat[8432]     = 1
kitten  = np.zeros(V); kitten[8433]  = 1
carb    = np.zeros(V); carb[31007]   = 1

print(cat @ kitten)   # 0.0 -- 'cat' and 'kitten' unrelated?!
print(cat @ carb)     # 0.0 -- same similarity as 'carburettor'

# Dense embeddings (here: pre-trained GloVe-style vectors, illustrative)
emb = {
  'cat':    np.array([0.61, 0.21, -0.43, 0.18]),
  'kitten': np.array([0.58, 0.25, -0.39, 0.22]),
  'carb':   np.array([-0.31, 0.74, 0.12, -0.55]),
}

def cos(a, b): return a @ b / (np.linalg.norm(a) * np.linalg.norm(b))

print(f"cos(cat, kitten) = {cos(emb['cat'], emb['kitten']):.3f}")
print(f"cos(cat, carb)   = {cos(emb['cat'], emb['carb']):.3f}")
# cos(cat, kitten) = 0.995   <- semantically similar = geometrically close
# cos(cat, carb)   = -0.468  <- unrelated = far apart
Distributed Representations
In a one-hot encoding, each word owns one dimension exclusively — a 'local' representation. In a dense embedding, meaning is distributed across all dimensions: no single dimension means anything alone, but together they place each word at a meaningful point in space.
This is the same principle behind every hidden layer in a neural network: concepts are encoded as directions and regions in a continuous space, not as individual units. Geoff Hinton called this the most important idea in deep learning.
8.2

Every embedding method in this chapter — and every language model since — rests on a single linguistic insight: words that occur in similar contexts have similar meanings. “Cat” and “kitten” both appear near “purr”, “fur”, “meow”, and “vet”; that contextual overlap is what the embedding captures.

History: From Firth to GPT
J.R. Firth articulated the distributional hypothesis in 1957. Zellig Harris formalised it in 1954. It lay mostly dormant in NLP until Latent Semantic Analysis (1990) operationalised it with SVD on term-document matrices. Word2Vec (2013) made it scale. GPT (2018–present) is its ultimate realisation: predicting words from context at internet scale.
Every step of this lineage uses the same supervision signal — free, unlimited, self-generated from raw text. No human labels required. This is why self-supervised learning won.

Co-occurrence Counts: The Raw Signal

The simplest way to operationalise the hypothesis: build a co-occurrence matrix X where X[i][j] counts how many times word j appears within a window of w words around word i, summed over a corpus.

PythonBuilding a co-occurrence matrix
import numpy as np
from collections import defaultdict, Counter

corpus = """the cat sat on the mat . the dog sat on the rug .
            the cat chased the dog . the kitten purred softly .""".split()

def build_cooccurrence(tokens, window=2):
    vocab  = sorted(set(tokens))
    w2i    = {w: i for i, w in enumerate(vocab)}
    V      = len(vocab)
    X      = np.zeros((V, V))
    for i, word in enumerate(tokens):
        for j in range(max(0, i-window), min(len(tokens), i+window+1)):
            if i != j:
                X[w2i[word], w2i[tokens[j]]] += 1
    return X, vocab, w2i

X, vocab, w2i = build_cooccurrence(corpus, window=2)

# 'cat' and 'dog' co-occur with similar words -> similar rows
def cos(a, b): return a@b / (np.linalg.norm(a)*np.linalg.norm(b) + 1e-9)
print(f"cos(cat, dog)    = {cos(X[w2i['cat']], X[w2i['dog']]):.3f}")
print(f"cos(cat, softly) = {cos(X[w2i['cat']], X[w2i['softly']]):.3f}")
# cos(cat, dog)    = 0.667  -- both are 'sat on / chased' animals
# cos(cat, softly) = 0.000  -- no shared context in this tiny corpus

From Counts to Embeddings: Three Strategies

StrategyMethodRepresentative
Count + factoriseBuild co-occurrence matrix, reduce with SVDLSA (1990), HAL (1996)
Predict contextTrain a classifier to predict context wordsWord2Vec (2013), fastText (2016)
Count + weighted fitFit log co-occurrence ratios directlyGloVe (2014)

The remarkable result (Levy & Goldberg, 2014): these three families are mathematically connected. Skip-gram with negative sampling implicitly factorises a shifted PMI (pointwise mutual information) matrix. Counting and predicting converge on the same geometry.

8.3

Word2Vec (Mikolov et al., 2013) reframes embedding learning as a prediction task: train a shallow network to predict context words from a target word (Skip-gram) or a target word from its context (CBOW). The embeddings are the learned weights — the prediction task itself is discarded after training.

Two Architectures

Skip-gram: predict context from wordCBOW: predict word from context
Input: centre word wInput: average of context words
Output: each context word c in windowOutput: centre word w
One training pair per (w, c) combinationOne training pair per centre position
Better for rare words (more updates each)Faster training (fewer updates)
Better on semantic analogy tasksBetter on syntactic tasks, small data
Default choice in practiceUsed when speed matters most

The Skip-gram Objective

Given a corpus of T tokens, maximise the log-probability of context words within a window of size m around each position:

J(θ) = (1/T) Σₜ Σ_{-m≤j≤m, j≠0} log P(w_{t+j} | w_t)
P(c | w) = exp(u_c · v_w) / Σ_{c'∈V} exp(u_{c'} · v_w) ← softmax over the whole vocabulary!

Note the two embedding matrices: v (input/centre embeddings) and u (output/context embeddings). Each word has two vectors during training; typically only v is kept (or v and u are averaged).

PythonSkip-gram with full softmax — from scratch (small vocab)
import numpy as np

class SkipGram:
    def __init__(self, V, dim=50, lr=0.05, seed=0):
        rng    = np.random.default_rng(seed)
        self.V_in  = rng.normal(0, 0.1, (V, dim))  # centre embeddings v_w
        self.V_out = rng.normal(0, 0.1, (V, dim))  # context embeddings u_c
        self.lr = lr

    def step(self, centre, context):  # one (w, c) training pair
        v = self.V_in[centre]              # (dim,)
        scores = self.V_out @ v            # (V,) logits over vocab
        scores -= scores.max()             # stable softmax
        p = np.exp(scores); p /= p.sum()   # (V,) probabilities

        # Gradient of cross-entropy: p - onehot(context)
        grad = p.copy(); grad[context] -= 1.0

        # Backprop into both embedding matrices
        dV_in  = self.V_out.T @ grad       # (dim,)
        dV_out = np.outer(grad, v)         # (V, dim)
        self.V_in[centre] -= self.lr * dV_in
        self.V_out        -= self.lr * dV_out
        return -np.log(p[context] + 1e-12)  # loss for monitoring

# Generate training pairs from corpus
def make_pairs(tokens, w2i, window=2):
    pairs = []
    for i, w in enumerate(tokens):
        for j in range(max(0, i-window), min(len(tokens), i+window+1)):
            if i != j: pairs.append((w2i[w], w2i[tokens[j]]))
    return pairs

pairs = make_pairs(corpus, w2i)
sg    = SkipGram(V=len(vocab), dim=20)
for epoch in range(200):
    loss = np.mean([sg.step(w, c) for w, c in pairs])
    if epoch % 50 == 0: print(f"epoch {epoch}: loss {loss:.3f}")
# epoch 0:   loss 2.973
# epoch 150: loss 1.213   -- 'cat' and 'dog' now have cos ≈ 0.7
⚠️
Pitfall: The Softmax Bottleneck
The denominator of the softmax sums over the entire vocabulary — every training pair requires O(V·dim) work. With V = 1 million and billions of training pairs, full-softmax Word2Vec would take months on 2013 hardware.
This is the single bottleneck that negative sampling (next section) eliminates, reducing per-pair cost from O(V) to O(k) where k ≈ 5–20 negative samples.
8.4

Negative sampling (Mikolov et al., 2013b) sidesteps the softmax entirely. Instead of asking “which word in V is the context?”, it asks a binary question: “is this (word, context) pair real or random?” For each true pair, draw k random “negative” contexts and train a logistic classifier to distinguish them.

The SGNS Objective

J = log σ(u_c · v_w) + Σᵢ₌₁ᵏ E_{cᵢ~Pₙ} [ log σ(−u_{cᵢ} · v_w) ]

The first term pushes the true pair's dot product up; the k negative terms push random pairs' dot products down. The noise distribution Pₙ(w) ∝ count(w)^{3/4} — the 3/4 power down-weights very frequent words like “the” while still sampling them more than rare words.

textSkip-gram with Negative Sampling (SGNS) (Pseudocode)
for each (centre w, context c) pair in corpus:

  # Positive update: this pair is real
  g  ←  σ(u_c · v_w) − 1        # gradient of −log σ(score)
  u_c ← u_c − lr · g · v_w
  acc ← g · u_c                # accumulate gradient for v_w

  # Negative updates: k random fake pairs
  for i = 1…k:
    n  ←  sample from Pₙ(w) ∝ count(w)^0.75
    g  ←  σ(u_n · v_w)           # want score → −∞, so σ → 0
    u_n ← u_n − lr · g · v_w
    acc ← acc + g · u_n

  v_w ← v_w − lr · acc          # single update for the centre word
PythonSGNS from scratch — the production Word2Vec algorithm
import numpy as np
from collections import Counter

class SGNS:
    def __init__(self, tokens, dim=100, k=5, lr=0.025, seed=0):
        self.vocab = sorted(set(tokens))
        self.w2i   = {w:i for i,w in enumerate(self.vocab)}
        V          = len(self.vocab)
        rng        = np.random.default_rng(seed)
        self.W_in  = rng.uniform(-0.5/dim, 0.5/dim, (V, dim))
        self.W_out = np.zeros((V, dim))  # standard word2vec init
        self.k, self.lr = k, lr

        # Unigram^0.75 noise distribution
        counts = Counter(tokens)
        freqs  = np.array([counts[w] for w in self.vocab], dtype=float)
        self.noise = freqs**0.75; self.noise /= self.noise.sum()
        self.rng = rng

    def _sigmoid(self, x): return 1 / (1 + np.exp(-np.clip(x, -20, 20)))

    def step(self, centre, context):
        v   = self.W_in[centre]
        acc = np.zeros_like(v)

        # --- Positive pair ---
        u = self.W_out[context]
        g = self._sigmoid(u @ v) - 1.0       # ∂/∂score of -log σ(score)
        acc += g * u
        self.W_out[context] -= self.lr * g * v

        # --- k negative samples ---
        negs = self.rng.choice(len(self.vocab), self.k, p=self.noise)
        for n in negs:
            if n == context: continue
            u = self.W_out[n]
            g = self._sigmoid(u @ v)           # want σ → 0 for fakes
            acc += g * u
            self.W_out[n] -= self.lr * g * v

        self.W_in[centre] -= self.lr * acc

    def most_similar(self, word, topn=5):
        v = self.W_in[self.w2i[word]]
        sims = self.W_in @ v / (np.linalg.norm(self.W_in, axis=1) * np.linalg.norm(v) + 1e-9)
        best = np.argsort(-sims)[1:topn+1]  # skip self
        return [(self.vocab[i], sims[i]) for i in best]

# Cost per pair: O(k·dim) instead of O(V·dim).
# For V=1M, k=5: a 200,000x speedup. This is what made Word2Vec feasible.

Frequent Word Subsampling

A second crucial trick: very frequent words (“the”, “of”, “and”) provide little signal but dominate the training pairs. Word2Vec discards each occurrence of word w with probability:

P(discard w) = 1 − √(t / f(w)) with t ≈ 10⁻⁵ and f(w) the word frequency

This both speeds up training and improves embedding quality — rare-word vectors get relatively more updates, and context windows effectively widen (discarded words don't block more distant content words).

ML Connection: Negative Sampling Everywhere
The negative-sampling pattern — contrast a true pair against random fakes — reappears throughout modern ML: InfoNCE in contrastive learning (Chapter 4), CLIP's image-text matching, two-tower retrieval models, and the candidate sampling used to train recommendation systems.
Even RLHF's reward model training is a cousin: it contrasts a preferred response against a rejected one using the same logistic objective.
8.5

Word2Vec streams through the corpus pair by pair, never aggregating global statistics. GloVe (Pennington, Socher & Manning, 2014) takes the opposite approach: first build the full co-occurrence matrix X, then fit embeddings so that dot products reproduce the log co-occurrence counts.

The Key Insight: Ratios of Co-occurrence Probabilities

Consider words i = “ice” and j = “steam”. The ratio P(k|ice)/P(k|steam) discriminates meaning: for k = “solid” the ratio is large; for k = “gas” it is small; for k = “water” (related to both) or k = “fashion” (related to neither), the ratio is near 1. GloVe constructs embeddings whose differences encode these ratios.

The GloVe Objective

J = Σᵢⱼ f(Xᵢⱼ) · ( wᵢ · w̃ⱼ + bᵢ + b̃ⱼ − log Xᵢⱼ )²

Each term is a weighted squared error: the dot product of word vector wᵢ and context vector w̃ⱼ (plus biases) should equal the log co-occurrence count. The weighting function f caps the influence of extremely frequent pairs:

f(x) = (x / x_max)^α if x < x_max, else 1 (typically α = 0.75, x_max = 100)
PythonGloVe from scratch
import numpy as np

class GloVe:
    def __init__(self, X, dim=50, x_max=100, alpha=0.75, lr=0.05, seed=0):
        V = X.shape[0]
        rng = np.random.default_rng(seed)
        # Two embedding matrices + two bias vectors
        self.W  = rng.normal(0, 0.1, (V, dim))   # word vectors
        self.Wc = rng.normal(0, 0.1, (V, dim))   # context vectors
        self.b  = np.zeros(V); self.bc = np.zeros(V)
        self.X, self.x_max, self.alpha, self.lr = X, x_max, alpha, lr
        # Pre-compute the non-zero co-occurrence entries
        self.nz = np.argwhere(X > 0)

    def _weight(self, x):
        return np.minimum((x / self.x_max)**self.alpha, 1.0)

    def train_epoch(self):
        total = 0.0
        np.random.shuffle(self.nz)
        for i, j in self.nz:
            xij  = self.X[i, j]
            # Prediction error: w_i·w̃_j + b_i + b̃_j − log X_ij
            diff = self.W[i] @ self.Wc[j] + self.b[i] + self.bc[j] - np.log(xij)
            fw   = self._weight(xij)
            total += fw * diff**2

            # Gradients (weighted squared loss)
            g = 2 * fw * diff
            gW, gWc = g * self.Wc[j], g * self.W[i]
            self.W[i]  -= self.lr * gW
            self.Wc[j] -= self.lr * gWc
            self.b[i]  -= self.lr * g
            self.bc[j] -= self.lr * g
        return total / len(self.nz)

    @property
    def embeddings(self):
        return self.W + self.Wc   # GloVe paper: sum the two matrices

glove = GloVe(X, dim=20)
for epoch in range(100):
    loss = glove.train_epoch()
    if epoch % 25 == 0: print(f"epoch {epoch}: weighted loss {loss:.4f}")
# epoch 0:  weighted loss 1.8042
# epoch 75: weighted loss 0.0031   -- dot products now ≈ log co-occurrences

Word2Vec vs. GloVe

Word2Vec (SGNS)GloVe
Streaming: one pair at a timeBatch: pre-computed co-occurrence matrix
Implicit matrix factorisation (shifted PMI)Explicit weighted factorisation of log X
Memory: O(V·dim) — corpus streamedMemory: O(nnz(X)) — matrix must fit
Local context onlyGlobal corpus statistics
Stochastic; sensitive to pair orderDeterministic given X (up to init)
Slightly better on analogy tasksSlightly better on similarity tasks

In practice, both produce embeddings of comparable quality on most benchmarks. The Levy & Goldberg (2015) systematic study found that hyperparameters (window size, subsampling, vector dimension) matter more than the choice between SGNS and GloVe.

8.6

Word2Vec and GloVe share a blind spot: each word is an atomic unit. “Run”, “running”, and “runner” get independent vectors that share no parameters — the model cannot exploit their obvious morphological relationship. Worse, any word not seen in training has no vector at all (the OOV problem).

fastText (Bojanowski et al., 2016) fixes both problems with one idea: represent each word as the sum of its character n-gram embeddings.

The Subword Decomposition

The word “where” with n-grams of length 3–6, including boundary markers < and >:

PythonCharacter n-gram decomposition
def char_ngrams(word, n_min=3, n_max=6):
    """fastText-style subword decomposition with boundary markers."""
    w = '<' + word + '>'
    grams = []
    for n in range(n_min, min(n_max, len(w)) + 1):
        for i in range(len(w) - n + 1):
            grams.append(w[i:i+n])
    grams.append(w)        # the full word is also a 'gram'
    return grams

print(char_ngrams('where', 3, 4))
# ['<wh', 'whe', 'her', 'ere', 're>',          <- 3-grams
#  '<whe', 'wher', 'here', 'ere>',             <- 4-grams
#  '<where>']                                   <- full word

# Word vector = sum of n-gram vectors:
#   v(where) = z(<wh) + z(whe) + z(her) + ... + z(<where>)
# 'her' inside 'where' shares the SAME z(her) used in 'her', 'here', 'there'

Why Subwords Win

ProblemWord2Vec / GloVefastText
OOV wordsNo vector — fail or use <UNK>Sum the n-gram vectors → usable embedding
Rare wordsPoor vectors (few updates)Share n-grams with frequent words
Morphology (run/running)Independent vectorsShared subwords link related forms
Typos (recieve)No vectorMostly-overlapping n-grams → close to 'receive'
Agglutinative languagesVocabulary explosionSubwords keep vocab manageable
PythonUsing fastText — OOV handling demo
# pip install fasttext  (or gensim's FastText)
from gensim.models import FastText

# Train on a small corpus (in practice: Wikipedia, Common Crawl)
sentences = [s.split() for s in [
    'the cat sat on the mat',
    'the kitten purred softly',
    'dogs and cats are running in the park',
]]

ft = FastText(sentences, vector_size=50, window=3, min_count=1,
              min_n=3, max_n=6, epochs=100)

# OOV word: 'catlike' was NEVER in training data
v_oov = ft.wv['catlike']  # works! built from <ca, cat, atl, ... n-grams
print(ft.wv.similarity('catlike', 'cat'))
# 0.71  -- shares 'cat' n-grams, lands near 'cat' in embedding space

# Typo robustness
print(ft.wv.similarity('runing', 'running'))
# 0.89  -- overlapping n-grams make typos land near correct spellings
ML Connection: From fastText to BPE Tokenization
fastText's subword idea evolved into the tokenization schemes of modern LLMs. Byte-Pair Encoding (BPE, used by GPT) and WordPiece (BERT) learn a vocabulary of subword units from data, so that rare words decompose into known pieces: 'tokenization' → ['token', 'ization'].
The key difference: fastText sums n-gram vectors into one word vector; BPE tokenizers feed each subword as a separate token to the Transformer, which composes them contextually. Chapter 14 covers tokenization in depth.
8.7

The most celebrated property of word embeddings: vector arithmetic captures semantic relationships. The canonical example — v(king) − v(man) + v(woman) ≈ v(queen) — suggests that the difference vector v(king) − v(man) encodes a reusable “royalty minus maleness” direction.

The Analogy Mechanism

answer = argmax_{w ∉ {a,b,c}} cos( v_w , v_b − v_a + v_c ) a:b :: c:?
PythonAnalogies with pre-trained GloVe vectors
import numpy as np
import gensim.downloader as api

# Load 100-d GloVe trained on Wikipedia + Gigaword (~130k vocab)
glv = api.load('glove-wiki-gigaword-100')

# --- The classic ---
print(glv.most_similar(positive=['king', 'woman'], negative=['man'], topn=3))
# [('queen', 0.770), ('monarch', 0.684), ('throne', 0.683)]

# --- Syntactic analogies work too ---
print(glv.most_similar(positive=['walking', 'swam'], negative=['walked'], topn=1))
# [('swimming', 0.789)]   walked:walking :: swam:swimming

# --- Geographic ---
print(glv.most_similar(positive=['paris', 'italy'], negative=['france'], topn=1))
# [('rome', 0.842)]   france:paris :: italy:rome

# --- Nearest neighbours reveal local structure ---
print(glv.most_similar('transformer', topn=5))
# [('transformers',...), ('amplifier',...), ('voltage',...)]
# 2014 vectors: 'transformer' is electrical, not neural. Embeddings freeze
# the corpus's snapshot of meaning -- a key limitation.
⚠️
The Analogy Caveat
The standard analogy evaluation excludes the input words a, b, c from candidates. Without that exclusion, the nearest vector to v(king) − v(man) + v(woman) is usually v(king) itself! The celebrated parallelogram structure is real but weaker than the demos suggest (Linzen 2016; Nissim et al. 2020).
Embedding arithmetic also surfaces corpus biases: v(doctor) − v(man) + v(woman) famously returned 'nurse' in early embeddings — the geometry faithfully encodes the statistics of the training text, including its stereotypes. Debiasing methods exist but only mask the measured direction (Gonen & Goldberg, 2019).

Anisotropy: The Narrow Cone Problem

Trained embeddings are not spread uniformly over the sphere — they concentrate in a narrow cone, making all cosine similarities positive and inflated. Post-processing helps: mean-centering plus removing the top principal components (the “all-but-the-top” method, Mu & Viswanath 2018) measurably improves similarity benchmarks.

PythonMeasuring and fixing anisotropy
import numpy as np

E = glv.vectors                       # (V, 100) all embeddings

# Average pairwise cosine of random word pairs (should be ~0 if isotropic)
rng   = np.random.default_rng(0)
idx   = rng.choice(len(E), (2000, 2))
def cos_rows(A, B):
    return np.sum(A*B, axis=1) / (np.linalg.norm(A,axis=1)*np.linalg.norm(B,axis=1) + 1e-9)
avg_cos = cos_rows(E[idx[:,0]], E[idx[:,1]]).mean()
print(f"Mean random-pair cosine (raw):       {avg_cos:.3f}")
# 0.318  -- far from 0: vectors live in a narrow cone

# All-but-the-top: centre, then remove top-k principal components
E_c   = E - E.mean(axis=0)
U, S, Vt = np.linalg.svd(E_c, full_matrices=False)
k     = 3                                  # remove top 3 PCs
E_iso = E_c - (E_c @ Vt[:k].T) @ Vt[:k]
avg_cos_iso = cos_rows(E_iso[idx[:,0]], E_iso[idx[:,1]]).mean()
print(f"Mean random-pair cosine (corrected): {avg_cos_iso:.3f}")
# 0.011  -- near-isotropic; similarity scores are now meaningful
8.8

How do you know if one set of embeddings is better than another? Two evaluation families exist, and they frequently disagree — a fact with important practical consequences.

TypeBenchmarkWhat it measures
IntrinsicWordSim-353, SimLex-999Correlation of cos(w₁,w₂) with human similarity ratings
IntrinsicGoogle analogy set (19,544 questions)Accuracy of a:b :: c:? vector arithmetic
IntrinsicMEN, RareWord (RW)Similarity on frequent / rare word pairs
ExtrinsicText classification (e.g., SST-2)Accuracy when embeddings feed a downstream classifier
ExtrinsicNamed-entity recognition (CoNLL)F1 with embeddings as input features
ExtrinsicMachine translation (BLEU)Quality when used to initialise encoder embeddings
PythonIntrinsic evaluation: word similarity correlation
import numpy as np
from scipy.stats import spearmanr

# SimLex-999 style: (word1, word2, human_score 0-10)
pairs = [
    ('old', 'new', 1.58),    # antonyms: related but NOT similar
    ('smart', 'intelligent', 9.20),
    ('hard', 'difficult', 8.77),
    ('happy', 'cheerful', 9.55),
    ('coast', 'shore', 9.00),
    ('movie', 'popcorn', 2.50),  # associated, not similar
]

model_sims = [glv.similarity(a, b) for a, b, _ in pairs]
human_sims = [h for _, _, h in pairs]

rho, p = spearmanr(model_sims, human_sims)
print(f"Spearman ρ = {rho:.3f}  (p={p:.3f})")
# Spearman ρ = 0.829
# Note: GloVe scores 'old'/'new' as quite similar (they share contexts!)
# but humans rate them dissimilar. Distributional similarity conflates
# similarity with relatedness -- a fundamental limitation.
⚠️
Pitfall: Intrinsic ≠ Extrinsic
Embeddings that win on analogy benchmarks do not reliably win on downstream tasks (Faruqui et al., 2016). Window size illustrates why: small windows (2) capture syntactic/functional similarity; large windows (10+) capture topical relatedness. Which is 'better' depends entirely on the downstream task.
Rule of practice: choose embeddings by validating on YOUR task, not by leaderboard position on intrinsic benchmarks.
8.9

Every method in this chapter assigns each word exactly one vector. But “bank” means a financial institution, a river edge, a turning manoeuvre, and a pool shot. A single static vector must average these senses into one point — a point that may be close to none of them.

The Superposition of Senses

Arora et al. (2018) showed that a polysemous word's static vector is approximately a frequency-weighted linear combination of its sense vectors:

v(bank) ≈ α₁·v(bank_finance) + α₂·v(bank_river) + … with αᵢ ∝ sense frequency

Remarkably, the individual senses can be partially recovered from the combined vector using sparse coding — evidence that the information is superimposed, not destroyed. But for any specific sentence, the static vector still cannot tell you which sense is active.

PythonDemonstrating the polysemy problem
# Static embeddings: one vector regardless of context
print(glv.most_similar('bank', topn=6))
# [('banks',.79), ('banking',.74), ('credit',.69),
#  ('financial',.68), ('lending',.67), ('lender',.66)]
# The financial sense dominates -- river banks are invisible.

# Contextual embeddings (BERT): different vector per occurrence
import torch
from transformers import AutoTokenizer, AutoModel

tok   = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased').eval()

def word_vec_in_context(sentence, word):
    """Return BERT's contextual embedding of `word` inside `sentence`."""
    enc = tok(sentence, return_tensors='pt')
    with torch.no_grad():
        h = model(**enc).last_hidden_state[0]  # (T, 768)
    w_id = tok.convert_tokens_to_ids(word)
    pos  = (enc.input_ids[0] == w_id).nonzero()[0, 0]
    return h[pos]

v_fin   = word_vec_in_context('I deposited cash at the bank', 'bank')
v_river = word_vec_in_context('We fished from the river bank', 'bank')
v_fin2  = word_vec_in_context('The bank approved my loan', 'bank')

cos = lambda a, b: torch.cosine_similarity(a, b, dim=0).item()
print(f"finance vs finance: {cos(v_fin, v_fin2):.3f}")
print(f"finance vs river:   {cos(v_fin, v_river):.3f}")
# finance vs finance: 0.85   <- same sense, similar vectors
# finance vs river:   0.46   <- different senses, BERT separates them

The Conceptual Bridge to Part III

Contextual embeddings are not a different idea — they are the same distributional hypothesis, applied dynamically. A Transformer computes a fresh embedding for every token occurrence, conditioned on its entire surrounding sentence, by repeatedly mixing each token's vector with its context through attention. The static embedding table of Word2Vec becomes merely the input layer; the real representation emerges through the network.

EraMethodRepresentation
1990LSAOne vector per word; SVD of term-document counts
2013Word2VecOne vector per word; learned by context prediction
2016fastTextOne vector per word; composed from subword n-grams
2018ELMoOne vector per occurrence; bidirectional LSTM states
2018BERT / GPTOne vector per occurrence per layer; Transformer attention
8.10

Method Quick-Reference

MethodObjectiveStrengthWeakness
Word2Vec SGNSBinary: real pair vs k noise pairsFast, scalable, strong analogiesOOV; one vector per word
GloVeWeighted fit of wᵢ·w̃ⱼ ≈ log XᵢⱼGlobal statistics; deterministicMatrix memory; OOV
fastTextSGNS over char n-gram sumsOOV + morphology + typosSlightly blurrier vectors
LSATruncated SVD of countsSimple; interpretable factorsLinear; weaker quality
BERT (preview)Masked LM over TransformersContextual; sense-awareHeavy; needs Part III

Exercises

Exercises 1–10 are pen-and-paper; 11–18 require code.

Exercise 1: Pen & Paper
Show that the cosine similarity between any two distinct one-hot vectors is exactly 0, and that their Euclidean distance is always √2 regardless of which words they encode.
Exercise 2: Pen & Paper
The Skip-gram softmax requires O(V·d) per training pair. With V = 10⁶, d = 300, k = 5 negatives, compute the exact speedup factor of SGNS over the full softmax.
Exercise 3: Pen & Paper
Derive the gradient of the SGNS objective J = log σ(u_c·v_w) + Σ log σ(−u_n·v_w) with respect to v_w, u_c, and each u_n.
Exercise 4: Pen & Paper
Pointwise mutual information is PMI(w,c) = log [P(w,c)/(P(w)P(c))]. Show that SGNS implicitly factorises the matrix M = PMI − log k (Levy & Goldberg, 2014). What role does k play in the shift?
Exercise 5: Pen & Paper
The noise distribution uses count^0.75. Compare the sampling probability of a word with count 10⁶ vs. count 10² under exponents 1.0, 0.75, and 0. Which exponent treats them most equally?
Exercise 6: Pen & Paper
In the GloVe objective, what happens to a word pair with Xᵢⱼ = 0? How does the loss handle it, and why is this a feature rather than a bug?
Exercise 7: Pen & Paper
How many character 3-to-6-grams (including boundaries and the full token) does fastText extract from the word 'learning'? List the 3-grams explicitly.
Exercise 8: Pen & Paper
If v(king) − v(man) + v(woman) is closest to v(king) itself, what does the standard analogy evaluation do to still report 'queen'? Discuss what this implies about the parallelogram structure.
Exercise 9: Pen & Paper
Explain why a small context window (±2) produces embeddings where 'Hogwarts' is near 'Sunnydale' (other fictional schools), while a large window (±10) places it near 'Dumbledore' and 'wizard'. Which is similarity and which is relatedness?
Exercise 10: Pen & Paper
A word has two senses with corpus frequencies 90% and 10%. Using the superposition model v = 0.9·v₁ + 0.1·v₂ with ‖v₁‖=‖v₂‖=1 and v₁ ⊥ v₂, compute cos(v, v₂). What does this say about retrieving rare senses?
Exercise 11: Code
Build a co-occurrence matrix from any public-domain book (e.g., Project Gutenberg). Apply truncated SVD (LSA) with k = 50. Report the 5 nearest neighbours of 3 content words of your choice.
Exercise 12: Code
Implement Skip-gram with full softmax on a toy corpus (vocab ≤ 200). Verify the gradient with the finite-difference checker from Chapter 5.
Exercise 13: Code
Extend exercise 12 with negative sampling (k=5). Plot wall-clock time per epoch for both versions as vocabulary grows from 100 to 5,000 words.
Exercise 14: Code Lab
Implement GloVe from scratch on text8 (first 10MB). Train 50-d vectors for 25 epochs. Evaluate on 20 hand-picked analogy questions and report accuracy.
Exercise 15: Code
Use gensim FastText to demonstrate OOV handling: train on a small corpus, then query 10 words NOT in the corpus (including 2 deliberate typos). Report nearest neighbours for each.
Exercise 16: Code Lab
Download GloVe-100d via gensim. Measure anisotropy (mean random-pair cosine), apply all-but-the-top with k ∈ {1, 3, 5, 10}, and report Spearman ρ on WordSim-353 for each k. Which k is best?
Exercise 17: Code
Visualise embedding geometry: select 60 words from 6 categories (animals, countries, verbs, colours, professions, numbers), reduce GloVe vectors to 2-D with PCA and t-SNE, and plot both. Which method preserves cluster structure better?
Exercise 18: Code (Challenge)
Reproduce the polysemy experiment: collect 10 sentences each for two senses of 'bank' (or 'bass', 'spring'). Extract BERT embeddings for the target word in every sentence. Cluster with k-means (k=2) and report cluster purity. Then average all 20 vectors and find its nearest static GloVe neighbour — which sense wins?

Further reading: “Efficient Estimation of Word Representations in Vector Space” (Mikolov et al., 2013) and “Distributed Representations of Words and Phrases” (Mikolov et al., 2013b) — the original Word2Vec pair. “GloVe: Global Vectors for Word Representation” (Pennington et al., 2014). “Enriching Word Vectors with Subword Information” (Bojanowski et al., 2016) for fastText. “Neural Word Embedding as Implicit Matrix Factorization” (Levy & Goldberg, 2014) for the unifying theory. Jurafsky & Martin, “Speech and Language Processing” (3rd ed.) Chapter 6 for a textbook treatment.


Next → Chapter 9: Sequence Models: RNNs & LSTMs

Static embeddings give each word a fixed point in space, ignoring order entirely — “dog bites man” and “man bites dog” have identical bags of vectors. Chapter 9 introduces recurrent networks that read sequences token by token, maintaining a hidden state that accumulates context. We will meet the vanishing gradient problem head-on, see how LSTM gating solves it, and build the seq2seq-with-attention architecture whose limitations directly motivated the Transformer.

18 Exercises in this chapter
Attempt each exercise before checking the worked solutions.
View Solutions →