Part II: Classical ML & Representations

Chapter 8

Embeddings & Representation Learning

Word2Vec, GloVe, and fastText — learning dense vector spaces where geometry encodes meaning.

18 Exercises

Learning Objectives

1.	Explain why one-hot encodings fail and how dense distributed representations fix them.
2.	State the distributional hypothesis and connect it to every embedding method in this chapter.
3.	Derive the Skip-gram and CBOW objectives of Word2Vec and implement both from scratch.
4.	Understand negative sampling: why the full softmax is intractable and how NCE approximates it.
5.	Derive the GloVe objective from co-occurrence statistics and compare it to Word2Vec.
6.	Explain how fastText's subword n-grams handle rare and out-of-vocabulary words.
7.	Navigate embedding space geometry: cosine similarity, analogies, and their limits.
8.	Evaluate embeddings intrinsically (similarity benchmarks) and extrinsically (downstream tasks).

How should a computer represent the word “cat”? The obvious answer — a one-hot vector with a 1 in position 8,432 of a 50,000-dimensional vector — turns out to be nearly useless. This section explains why, and motivates the dense representations that power all of modern NLP.

The Problem with One-Hot Vectors

One-hot encoding	Dense embedding
Dimension = vocabulary size (50k–200k)	Dimension = 50–1024 (chosen freely)
Every pair of words is orthogonal	Similar words are close in space
cos(cat, kitten) = 0 = cos(cat, carburettor)	cos(cat, kitten) ≈ 0.8 >> cos(cat, carburettor)
Sparse: 1 non-zero entry	Dense: every dimension carries information
No generalisation between words	Statistical strength shared across similar words
Parameters grow with vocabulary	Parameters fixed by embedding dim

The fatal flaw of one-hot vectors: every pair of distinct words has dot product zero. The encoding contains no information about meaning whatsoever. A model trained on “the cat sat on the mat” learns nothing that transfers to “the kitten sat on the rug”, even though the sentences are nearly synonymous.

Python•One-hot vs. dense — the generalisation gap
import numpy as np

# One-hot: every word pair is orthogonal
V = 50000                       # vocabulary size
cat     = np.zeros(V); cat[8432]     = 1
kitten  = np.zeros(V); kitten[8433]  = 1
carb    = np.zeros(V); carb[31007]   = 1

print(cat @ kitten)   # 0.0 -- 'cat' and 'kitten' unrelated?!
print(cat @ carb)     # 0.0 -- same similarity as 'carburettor'

# Dense embeddings (here: pre-trained GloVe-style vectors, illustrative)
emb = {
  'cat':    np.array([0.61, 0.21, -0.43, 0.18]),
  'kitten': np.array([0.58, 0.25, -0.39, 0.22]),
  'carb':   np.array([-0.31, 0.74, 0.12, -0.55]),
}

def cos(a, b): return a @ b / (np.linalg.norm(a) * np.linalg.norm(b))

print(f"cos(cat, kitten) = {cos(emb['cat'], emb['kitten']):.3f}")
print(f"cos(cat, carb)   = {cos(emb['cat'], emb['carb']):.3f}")
# cos(cat, kitten) = 0.995   <- semantically similar = geometrically close
# cos(cat, carb)   = -0.468  <- unrelated = far apart

✧

Distributed Representations

In a one-hot encoding, each word owns one dimension exclusively — a 'local' representation. In a dense embedding, meaning is distributed across all dimensions: no single dimension means anything alone, but together they place each word at a meaningful point in space.

This is the same principle behind every hidden layer in a neural network: concepts are encoded as directions and regions in a continuous space, not as individual units. Geoff Hinton called this the most important idea in deep learning.

Every embedding method in this chapter — and every language model since — rests on a single linguistic insight: words that occur in similar contexts have similar meanings. “Cat” and “kitten” both appear near “purr”, “fur”, “meow”, and “vet”; that contextual overlap is what the embedding captures.

✧

History: From Firth to GPT

J.R. Firth articulated the distributional hypothesis in 1957. Zellig Harris formalised it in 1954. It lay mostly dormant in NLP until Latent Semantic Analysis (1990) operationalised it with SVD on term-document matrices. Word2Vec (2013) made it scale. GPT (2018–present) is its ultimate realisation: predicting words from context at internet scale.

Every step of this lineage uses the same supervision signal — free, unlimited, self-generated from raw text. No human labels required. This is why self-supervised learning won.

Co-occurrence Counts: The Raw Signal

The simplest way to operationalise the hypothesis: build a co-occurrence matrix X where X[i][j] counts how many times word j appears within a window of w words around word i, summed over a corpus.

Python•Building a co-occurrence matrix
import numpy as np
from collections import defaultdict, Counter

corpus = """the cat sat on the mat . the dog sat on the rug .
            the cat chased the dog . the kitten purred softly .""".split()

def build_cooccurrence(tokens, window=2):
    vocab  = sorted(set(tokens))
    w2i    = {w: i for i, w in enumerate(vocab)}
    V      = len(vocab)
    X      = np.zeros((V, V))
    for i, word in enumerate(tokens):
        for j in range(max(0, i-window), min(len(tokens), i+window+1)):
            if i != j:
                X[w2i[word], w2i[tokens[j]]] += 1
    return X, vocab, w2i

X, vocab, w2i = build_cooccurrence(corpus, window=2)

# 'cat' and 'dog' co-occur with similar words -> similar rows
def cos(a, b): return a@b / (np.linalg.norm(a)*np.linalg.norm(b) + 1e-9)
print(f"cos(cat, dog)    = {cos(X[w2i['cat']], X[w2i['dog']]):.3f}")
print(f"cos(cat, softly) = {cos(X[w2i['cat']], X[w2i['softly']]):.3f}")
# cos(cat, dog)    = 0.667  -- both are 'sat on / chased' animals
# cos(cat, softly) = 0.000  -- no shared context in this tiny corpus

From Counts to Embeddings: Three Strategies

Strategy	Method	Representative
Count + factorise	Build co-occurrence matrix, reduce with SVD	LSA (1990), HAL (1996)
Predict context	Train a classifier to predict context words	Word2Vec (2013), fastText (2016)
Count + weighted fit	Fit log co-occurrence ratios directly	GloVe (2014)

The remarkable result (Levy & Goldberg, 2014): these three families are mathematically connected. Skip-gram with negative sampling implicitly factorises a shifted PMI (pointwise mutual information) matrix. Counting and predicting converge on the same geometry.

Word2Vec (Mikolov et al., 2013) reframes embedding learning as a prediction task: train a shallow network to predict context words from a target word (Skip-gram) or a target word from its context (CBOW). The embeddings are the learned weights — the prediction task itself is discarded after training.

Two Architectures

Skip-gram: predict context from word	CBOW: predict word from context
Input: centre word w	Input: average of context words
Output: each context word c in window	Output: centre word w
One training pair per (w, c) combination	One training pair per centre position
Better for rare words (more updates each)	Faster training (fewer updates)
Better on semantic analogy tasks	Better on syntactic tasks, small data
Default choice in practice	Used when speed matters most

The Skip-gram Objective

Given a corpus of T tokens, maximise the log-probability of context words within a window of size m around each position:

J(θ) = (1/T) Σₜ Σ_{-m≤j≤m, j≠0} log P(w_{t+j} | w_t)

P(c | w) = exp(u_c · v_w) / Σ_{c'∈V} exp(u_{c'} · v_w) ← softmax over the whole vocabulary!

Note the two embedding matrices: v (input/centre embeddings) and u (output/context embeddings). Each word has two vectors during training; typically only v is kept (or v and u are averaged).

Python•Skip-gram with full softmax — from scratch (small vocab)
import numpy as np

class SkipGram:
    def __init__(self, V, dim=50, lr=0.05, seed=0):
        rng    = np.random.default_rng(seed)
        self.V_in  = rng.normal(0, 0.1, (V, dim))  # centre embeddings v_w
        self.V_out = rng.normal(0, 0.1, (V, dim))  # context embeddings u_c
        self.lr = lr

    def step(self, centre, context):  # one (w, c) training pair
        v = self.V_in[centre]              # (dim,)
        scores = self.V_out @ v            # (V,) logits over vocab
        scores -= scores.max()             # stable softmax
        p = np.exp(scores); p /= p.sum()   # (V,) probabilities

        # Gradient of cross-entropy: p - onehot(context)
        grad = p.copy(); grad[context] -= 1.0

        # Backprop into both embedding matrices
        dV_in  = self.V_out.T @ grad       # (dim,)
        dV_out = np.outer(grad, v)         # (V, dim)
        self.V_in[centre] -= self.lr * dV_in
        self.V_out        -= self.lr * dV_out
        return -np.log(p[context] + 1e-12)  # loss for monitoring

# Generate training pairs from corpus
def make_pairs(tokens, w2i, window=2):
    pairs = []
    for i, w in enumerate(tokens):
        for j in range(max(0, i-window), min(len(tokens), i+window+1)):
            if i != j: pairs.append((w2i[w], w2i[tokens[j]]))
    return pairs

pairs = make_pairs(corpus, w2i)
sg    = SkipGram(V=len(vocab), dim=20)
for epoch in range(200):
    loss = np.mean([sg.step(w, c) for w, c in pairs])
    if epoch % 50 == 0: print(f"epoch {epoch}: loss {loss:.3f}")
# epoch 0:   loss 2.973
# epoch 150: loss 1.213   -- 'cat' and 'dog' now have cos ≈ 0.7

⚠️

Pitfall: The Softmax Bottleneck

The denominator of the softmax sums over the entire vocabulary — every training pair requires O(V·dim) work. With V = 1 million and billions of training pairs, full-softmax Word2Vec would take months on 2013 hardware.

This is the single bottleneck that negative sampling (next section) eliminates, reducing per-pair cost from O(V) to O(k) where k ≈ 5–20 negative samples.

Negative sampling (Mikolov et al., 2013b) sidesteps the softmax entirely. Instead of asking “which word in V is the context?”, it asks a binary question: “is this (word, context) pair real or random?” For each true pair, draw k random “negative” contexts and train a logistic classifier to distinguish them.

The SGNS Objective

J = log σ(u_c · v_w) + Σᵢ₌₁ᵏ E_{cᵢ~Pₙ} [ log σ(−u_{cᵢ} · v_w) ]

The first term pushes the true pair's dot product up; the k negative terms push random pairs' dot products down. The noise distribution Pₙ(w) ∝ count(w)^{3/4} — the 3/4 power down-weights very frequent words like “the” while still sampling them more than rare words.

text•Skip-gram with Negative Sampling (SGNS) (Pseudocode)
for each (centre w, context c) pair in corpus:

  # Positive update: this pair is real
  g  ←  σ(u_c · v_w) − 1        # gradient of −log σ(score)
  u_c ← u_c − lr · g · v_w
  acc ← g · u_c                # accumulate gradient for v_w

  # Negative updates: k random fake pairs
  for i = 1…k:
    n  ←  sample from Pₙ(w) ∝ count(w)^0.75
    g  ←  σ(u_n · v_w)           # want score → −∞, so σ → 0
    u_n ← u_n − lr · g · v_w
    acc ← acc + g · u_n

  v_w ← v_w − lr · acc          # single update for the centre word

Python•SGNS from scratch — the production Word2Vec algorithm
import numpy as np
from collections import Counter

class SGNS:
    def __init__(self, tokens, dim=100, k=5, lr=0.025, seed=0):
        self.vocab = sorted(set(tokens))
        self.w2i   = {w:i for i,w in enumerate(self.vocab)}
        V          = len(self.vocab)
        rng        = np.random.default_rng(seed)
        self.W_in  = rng.uniform(-0.5/dim, 0.5/dim, (V, dim))
        self.W_out = np.zeros((V, dim))  # standard word2vec init
        self.k, self.lr = k, lr

        # Unigram^0.75 noise distribution
        counts = Counter(tokens)
        freqs  = np.array([counts[w] for w in self.vocab], dtype=float)
        self.noise = freqs**0.75; self.noise /= self.noise.sum()
        self.rng = rng

    def _sigmoid(self, x): return 1 / (1 + np.exp(-np.clip(x, -20, 20)))

    def step(self, centre, context):
        v   = self.W_in[centre]
        acc = np.zeros_like(v)

        # --- Positive pair ---
        u = self.W_out[context]
        g = self._sigmoid(u @ v) - 1.0       # ∂/∂score of -log σ(score)
        acc += g * u
        self.W_out[context] -= self.lr * g * v

        # --- k negative samples ---
        negs = self.rng.choice(len(self.vocab), self.k, p=self.noise)
        for n in negs:
            if n == context: continue
            u = self.W_out[n]
            g = self._sigmoid(u @ v)           # want σ → 0 for fakes
            acc += g * u
            self.W_out[n] -= self.lr * g * v

        self.W_in[centre] -= self.lr * acc

    def most_similar(self, word, topn=5):
        v = self.W_in[self.w2i[word]]
        sims = self.W_in @ v / (np.linalg.norm(self.W_in, axis=1) * np.linalg.norm(v) + 1e-9)
        best = np.argsort(-sims)[1:topn+1]  # skip self
        return [(self.vocab[i], sims[i]) for i in best]

# Cost per pair: O(k·dim) instead of O(V·dim).
# For V=1M, k=5: a 200,000x speedup. This is what made Word2Vec feasible.

Frequent Word Subsampling

A second crucial trick: very frequent words (“the”, “of”, “and”) provide little signal but dominate the training pairs. Word2Vec discards each occurrence of word w with probability:

P(discard w) = 1 − √(t / f(w)) with t ≈ 10⁻⁵ and f(w) the word frequency

This both speeds up training and improves embedding quality — rare-word vectors get relatively more updates, and context windows effectively widen (discarded words don't block more distant content words).

▶

ML Connection: Negative Sampling Everywhere

The negative-sampling pattern — contrast a true pair against random fakes — reappears throughout modern ML: InfoNCE in contrastive learning (Chapter 4), CLIP's image-text matching, two-tower retrieval models, and the candidate sampling used to train recommendation systems.

Even RLHF's reward model training is a cousin: it contrasts a preferred response against a rejected one using the same logistic objective.

Word2Vec streams through the corpus pair by pair, never aggregating global statistics. GloVe (Pennington, Socher & Manning, 2014) takes the opposite approach: first build the full co-occurrence matrix X, then fit embeddings so that dot products reproduce the log co-occurrence counts.

The Key Insight: Ratios of Co-occurrence Probabilities

Consider words i = “ice” and j = “steam”. The ratio P(k|ice)/P(k|steam) discriminates meaning: for k = “solid” the ratio is large; for k = “gas” it is small; for k = “water” (related to both) or k = “fashion” (related to neither), the ratio is near 1. GloVe constructs embeddings whose differences encode these ratios.

The GloVe Objective

J = Σᵢⱼ f(Xᵢⱼ) · ( wᵢ · w̃ⱼ + bᵢ + b̃ⱼ − log Xᵢⱼ )²

Each term is a weighted squared error: the dot product of word vector wᵢ and context vector w̃ⱼ (plus biases) should equal the log co-occurrence count. The weighting function f caps the influence of extremely frequent pairs:

f(x) = (x / x_max)^α if x < x_max, else 1 (typically α = 0.75, x_max = 100)

Python•GloVe from scratch
import numpy as np

class GloVe:
    def __init__(self, X, dim=50, x_max=100, alpha=0.75, lr=0.05, seed=0):
        V = X.shape[0]
        rng = np.random.default_rng(seed)
        # Two embedding matrices + two bias vectors
        self.W  = rng.normal(0, 0.1, (V, dim))   # word vectors
        self.Wc = rng.normal(0, 0.1, (V, dim))   # context vectors
        self.b  = np.zeros(V); self.bc = np.zeros(V)
        self.X, self.x_max, self.alpha, self.lr = X, x_max, alpha, lr
        # Pre-compute the non-zero co-occurrence entries
        self.nz = np.argwhere(X > 0)

    def _weight(self, x):
        return np.minimum((x / self.x_max)**self.alpha, 1.0)

    def train_epoch(self):
        total = 0.0
        np.random.shuffle(self.nz)
        for i, j in self.nz:
            xij  = self.X[i, j]
            # Prediction error: w_i·w̃_j + b_i + b̃_j − log X_ij
            diff = self.W[i] @ self.Wc[j] + self.b[i] + self.bc[j] - np.log(xij)
            fw   = self._weight(xij)
            total += fw * diff**2

            # Gradients (weighted squared loss)
            g = 2 * fw * diff
            gW, gWc = g * self.Wc[j], g * self.W[i]
            self.W[i]  -= self.lr * gW
            self.Wc[j] -= self.lr * gWc
            self.b[i]  -= self.lr * g
            self.bc[j] -= self.lr * g
        return total / len(self.nz)

    @property
    def embeddings(self):
        return self.W + self.Wc   # GloVe paper: sum the two matrices

glove = GloVe(X, dim=20)
for epoch in range(100):
    loss = glove.train_epoch()
    if epoch % 25 == 0: print(f"epoch {epoch}: weighted loss {loss:.4f}")
# epoch 0:  weighted loss 1.8042
# epoch 75: weighted loss 0.0031   -- dot products now ≈ log co-occurrences

Word2Vec vs. GloVe

Word2Vec (SGNS)	GloVe
Streaming: one pair at a time	Batch: pre-computed co-occurrence matrix
Implicit matrix factorisation (shifted PMI)	Explicit weighted factorisation of log X
Memory: O(V·dim) — corpus streamed	Memory: O(nnz(X)) — matrix must fit
Local context only	Global corpus statistics
Stochastic; sensitive to pair order	Deterministic given X (up to init)
Slightly better on analogy tasks	Slightly better on similarity tasks

In practice, both produce embeddings of comparable quality on most benchmarks. The Levy & Goldberg (2015) systematic study found that hyperparameters (window size, subsampling, vector dimension) matter more than the choice between SGNS and GloVe.

Word2Vec and GloVe share a blind spot: each word is an atomic unit. “Run”, “running”, and “runner” get independent vectors that share no parameters — the model cannot exploit their obvious morphological relationship. Worse, any word not seen in training has no vector at all (the OOV problem).

fastText (Bojanowski et al., 2016) fixes both problems with one idea: represent each word as the sum of its character n-gram embeddings.

The Subword Decomposition

The word “where” with n-grams of length 3–6, including boundary markers < and >:

Python•Character n-gram decomposition
def char_ngrams(word, n_min=3, n_max=6):
    """fastText-style subword decomposition with boundary markers."""
    w = '<' + word + '>'
    grams = []
    for n in range(n_min, min(n_max, len(w)) + 1):
        for i in range(len(w) - n + 1):
            grams.append(w[i:i+n])
    grams.append(w)        # the full word is also a 'gram'
    return grams

print(char_ngrams('where', 3, 4))
# ['<wh', 'whe', 'her', 'ere', 're>',          <- 3-grams
#  '<whe', 'wher', 'here', 'ere>',             <- 4-grams
#  '<where>']                                   <- full word

# Word vector = sum of n-gram vectors:
#   v(where) = z(<wh) + z(whe) + z(her) + ... + z(<where>)
# 'her' inside 'where' shares the SAME z(her) used in 'her', 'here', 'there'

Why Subwords Win

Problem	Word2Vec / GloVe	fastText
OOV words	No vector — fail or use <UNK>	Sum the n-gram vectors → usable embedding
Rare words	Poor vectors (few updates)	Share n-grams with frequent words
Morphology (run/running)	Independent vectors	Shared subwords link related forms
Typos (recieve)	No vector	Mostly-overlapping n-grams → close to 'receive'
Agglutinative languages	Vocabulary explosion	Subwords keep vocab manageable

Python•Using fastText — OOV handling demo
# pip install fasttext  (or gensim's FastText)
from gensim.models import FastText

# Train on a small corpus (in practice: Wikipedia, Common Crawl)
sentences = [s.split() for s in [
    'the cat sat on the mat',
    'the kitten purred softly',
    'dogs and cats are running in the park',
]]

ft = FastText(sentences, vector_size=50, window=3, min_count=1,
              min_n=3, max_n=6, epochs=100)

# OOV word: 'catlike' was NEVER in training data
v_oov = ft.wv['catlike']  # works! built from <ca, cat, atl, ... n-grams
print(ft.wv.similarity('catlike', 'cat'))
# 0.71  -- shares 'cat' n-grams, lands near 'cat' in embedding space

# Typo robustness
print(ft.wv.similarity('runing', 'running'))
# 0.89  -- overlapping n-grams make typos land near correct spellings

▶

ML Connection: From fastText to BPE Tokenization

fastText's subword idea evolved into the tokenization schemes of modern LLMs. Byte-Pair Encoding (BPE, used by GPT) and WordPiece (BERT) learn a vocabulary of subword units from data, so that rare words decompose into known pieces: 'tokenization' → ['token', 'ization'].

The key difference: fastText sums n-gram vectors into one word vector; BPE tokenizers feed each subword as a separate token to the Transformer, which composes them contextually. Chapter 14 covers tokenization in depth.

The most celebrated property of word embeddings: vector arithmetic captures semantic relationships. The canonical example — v(king) − v(man) + v(woman) ≈ v(queen) — suggests that the difference vector v(king) − v(man) encodes a reusable “royalty minus maleness” direction.

The Analogy Mechanism

answer = argmax_{w ∉ {a,b,c}} cos( v_w , v_b − v_a + v_c ) a:b :: c:?

Python•Analogies with pre-trained GloVe vectors
import numpy as np
import gensim.downloader as api

# Load 100-d GloVe trained on Wikipedia + Gigaword (~130k vocab)
glv = api.load('glove-wiki-gigaword-100')

# --- The classic ---
print(glv.most_similar(positive=['king', 'woman'], negative=['man'], topn=3))
# [('queen', 0.770), ('monarch', 0.684), ('throne', 0.683)]

# --- Syntactic analogies work too ---
print(glv.most_similar(positive=['walking', 'swam'], negative=['walked'], topn=1))
# [('swimming', 0.789)]   walked:walking :: swam:swimming

# --- Geographic ---
print(glv.most_similar(positive=['paris', 'italy'], negative=['france'], topn=1))
# [('rome', 0.842)]   france:paris :: italy:rome

# --- Nearest neighbours reveal local structure ---
print(glv.most_similar('transformer', topn=5))
# [('transformers',...), ('amplifier',...), ('voltage',...)]
# 2014 vectors: 'transformer' is electrical, not neural. Embeddings freeze
# the corpus's snapshot of meaning -- a key limitation.

⚠️

The Analogy Caveat

The standard analogy evaluation excludes the input words a, b, c from candidates. Without that exclusion, the nearest vector to v(king) − v(man) + v(woman) is usually v(king) itself! The celebrated parallelogram structure is real but weaker than the demos suggest (Linzen 2016; Nissim et al. 2020).

Embedding arithmetic also surfaces corpus biases: v(doctor) − v(man) + v(woman) famously returned 'nurse' in early embeddings — the geometry faithfully encodes the statistics of the training text, including its stereotypes. Debiasing methods exist but only mask the measured direction (Gonen & Goldberg, 2019).

Anisotropy: The Narrow Cone Problem

Trained embeddings are not spread uniformly over the sphere — they concentrate in a narrow cone, making all cosine similarities positive and inflated. Post-processing helps: mean-centering plus removing the top principal components (the “all-but-the-top” method, Mu & Viswanath 2018) measurably improves similarity benchmarks.

Python•Measuring and fixing anisotropy
import numpy as np

E = glv.vectors                       # (V, 100) all embeddings

# Average pairwise cosine of random word pairs (should be ~0 if isotropic)
rng   = np.random.default_rng(0)
idx   = rng.choice(len(E), (2000, 2))
def cos_rows(A, B):
    return np.sum(A*B, axis=1) / (np.linalg.norm(A,axis=1)*np.linalg.norm(B,axis=1) + 1e-9)
avg_cos = cos_rows(E[idx[:,0]], E[idx[:,1]]).mean()
print(f"Mean random-pair cosine (raw):       {avg_cos:.3f}")
# 0.318  -- far from 0: vectors live in a narrow cone

# All-but-the-top: centre, then remove top-k principal components
E_c   = E - E.mean(axis=0)
U, S, Vt = np.linalg.svd(E_c, full_matrices=False)
k     = 3                                  # remove top 3 PCs
E_iso = E_c - (E_c @ Vt[:k].T) @ Vt[:k]
avg_cos_iso = cos_rows(E_iso[idx[:,0]], E_iso[idx[:,1]]).mean()
print(f"Mean random-pair cosine (corrected): {avg_cos_iso:.3f}")
# 0.011  -- near-isotropic; similarity scores are now meaningful

How do you know if one set of embeddings is better than another? Two evaluation families exist, and they frequently disagree — a fact with important practical consequences.

Type	Benchmark	What it measures
Intrinsic	WordSim-353, SimLex-999	Correlation of cos(w₁,w₂) with human similarity ratings
Intrinsic	Google analogy set (19,544 questions)	Accuracy of a:b :: c:? vector arithmetic
Intrinsic	MEN, RareWord (RW)	Similarity on frequent / rare word pairs
Extrinsic	Text classification (e.g., SST-2)	Accuracy when embeddings feed a downstream classifier
Extrinsic	Named-entity recognition (CoNLL)	F1 with embeddings as input features
Extrinsic	Machine translation (BLEU)	Quality when used to initialise encoder embeddings

Python•Intrinsic evaluation: word similarity correlation
import numpy as np
from scipy.stats import spearmanr

# SimLex-999 style: (word1, word2, human_score 0-10)
pairs = [
    ('old', 'new', 1.58),    # antonyms: related but NOT similar
    ('smart', 'intelligent', 9.20),
    ('hard', 'difficult', 8.77),
    ('happy', 'cheerful', 9.55),
    ('coast', 'shore', 9.00),
    ('movie', 'popcorn', 2.50),  # associated, not similar
]

model_sims = [glv.similarity(a, b) for a, b, _ in pairs]
human_sims = [h for _, _, h in pairs]

rho, p = spearmanr(model_sims, human_sims)
print(f"Spearman ρ = {rho:.3f}  (p={p:.3f})")
# Spearman ρ = 0.829
# Note: GloVe scores 'old'/'new' as quite similar (they share contexts!)
# but humans rate them dissimilar. Distributional similarity conflates
# similarity with relatedness -- a fundamental limitation.

⚠️

Pitfall: Intrinsic ≠ Extrinsic

Embeddings that win on analogy benchmarks do not reliably win on downstream tasks (Faruqui et al., 2016). Window size illustrates why: small windows (2) capture syntactic/functional similarity; large windows (10+) capture topical relatedness. Which is 'better' depends entirely on the downstream task.

Rule of practice: choose embeddings by validating on YOUR task, not by leaderboard position on intrinsic benchmarks.

Every method in this chapter assigns each word exactly one vector. But “bank” means a financial institution, a river edge, a turning manoeuvre, and a pool shot. A single static vector must average these senses into one point — a point that may be close to none of them.

The Superposition of Senses

Arora et al. (2018) showed that a polysemous word's static vector is approximately a frequency-weighted linear combination of its sense vectors:

v(bank) ≈ α₁·v(bank_finance) + α₂·v(bank_river) + … with αᵢ ∝ sense frequency

Remarkably, the individual senses can be partially recovered from the combined vector using sparse coding — evidence that the information is superimposed, not destroyed. But for any specific sentence, the static vector still cannot tell you which sense is active.

Python•Demonstrating the polysemy problem
# Static embeddings: one vector regardless of context
print(glv.most_similar('bank', topn=6))
# [('banks',.79), ('banking',.74), ('credit',.69),
#  ('financial',.68), ('lending',.67), ('lender',.66)]
# The financial sense dominates -- river banks are invisible.

# Contextual embeddings (BERT): different vector per occurrence
import torch
from transformers import AutoTokenizer, AutoModel

tok   = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased').eval()

def word_vec_in_context(sentence, word):
    """Return BERT's contextual embedding of `word` inside `sentence`."""
    enc = tok(sentence, return_tensors='pt')
    with torch.no_grad():
        h = model(**enc).last_hidden_state[0]  # (T, 768)
    w_id = tok.convert_tokens_to_ids(word)
    pos  = (enc.input_ids[0] == w_id).nonzero()[0, 0]
    return h[pos]

v_fin   = word_vec_in_context('I deposited cash at the bank', 'bank')
v_river = word_vec_in_context('We fished from the river bank', 'bank')
v_fin2  = word_vec_in_context('The bank approved my loan', 'bank')

cos = lambda a, b: torch.cosine_similarity(a, b, dim=0).item()
print(f"finance vs finance: {cos(v_fin, v_fin2):.3f}")
print(f"finance vs river:   {cos(v_fin, v_river):.3f}")
# finance vs finance: 0.85   <- same sense, similar vectors
# finance vs river:   0.46   <- different senses, BERT separates them

The Conceptual Bridge to Part III

Contextual embeddings are not a different idea — they are the same distributional hypothesis, applied dynamically. A Transformer computes a fresh embedding for every token occurrence, conditioned on its entire surrounding sentence, by repeatedly mixing each token's vector with its context through attention. The static embedding table of Word2Vec becomes merely the input layer; the real representation emerges through the network.

Era	Method	Representation
1990	LSA	One vector per word; SVD of term-document counts
2013	Word2Vec	One vector per word; learned by context prediction
2016	fastText	One vector per word; composed from subword n-grams
2018	ELMo	One vector per occurrence; bidirectional LSTM states
2018	BERT / GPT	One vector per occurrence per layer; Transformer attention

Method Quick-Reference

Method	Objective	Strength	Weakness
Word2Vec SGNS	Binary: real pair vs k noise pairs	Fast, scalable, strong analogies	OOV; one vector per word
GloVe	Weighted fit of wᵢ·w̃ⱼ ≈ log Xᵢⱼ	Global statistics; deterministic	Matrix memory; OOV
fastText	SGNS over char n-gram sums	OOV + morphology + typos	Slightly blurrier vectors
LSA	Truncated SVD of counts	Simple; interpretable factors	Linear; weaker quality
BERT (preview)	Masked LM over Transformers	Contextual; sense-aware	Heavy; needs Part III

Exercises

Exercises 1–10 are pen-and-paper; 11–18 require code.

✎

Exercise 1: Pen & Paper

Show that the cosine similarity between any two distinct one-hot vectors is exactly 0, and that their Euclidean distance is always √2 regardless of which words they encode.

✎

Exercise 2: Pen & Paper

The Skip-gram softmax requires O(V·d) per training pair. With V = 10⁶, d = 300, k = 5 negatives, compute the exact speedup factor of SGNS over the full softmax.

✎

Exercise 3: Pen & Paper

Derive the gradient of the SGNS objective J = log σ(u_c·v_w) + Σ log σ(−u_n·v_w) with respect to v_w, u_c, and each u_n.

✎

Exercise 4: Pen & Paper

Pointwise mutual information is PMI(w,c) = log [P(w,c)/(P(w)P(c))]. Show that SGNS implicitly factorises the matrix M = PMI − log k (Levy & Goldberg, 2014). What role does k play in the shift?

✎

Exercise 5: Pen & Paper

The noise distribution uses count^0.75. Compare the sampling probability of a word with count 10⁶ vs. count 10² under exponents 1.0, 0.75, and 0. Which exponent treats them most equally?

✎

Exercise 6: Pen & Paper

In the GloVe objective, what happens to a word pair with Xᵢⱼ = 0? How does the loss handle it, and why is this a feature rather than a bug?

✎

Exercise 7: Pen & Paper

How many character 3-to-6-grams (including boundaries and the full token) does fastText extract from the word 'learning'? List the 3-grams explicitly.

✎

Exercise 8: Pen & Paper

If v(king) − v(man) + v(woman) is closest to v(king) itself, what does the standard analogy evaluation do to still report 'queen'? Discuss what this implies about the parallelogram structure.

✎

Exercise 9: Pen & Paper

Explain why a small context window (±2) produces embeddings where 'Hogwarts' is near 'Sunnydale' (other fictional schools), while a large window (±10) places it near 'Dumbledore' and 'wizard'. Which is similarity and which is relatedness?

✎

Exercise 10: Pen & Paper

A word has two senses with corpus frequencies 90% and 10%. Using the superposition model v = 0.9·v₁ + 0.1·v₂ with ‖v₁‖=‖v₂‖=1 and v₁ ⊥ v₂, compute cos(v, v₂). What does this say about retrieving rare senses?

✎

Exercise 11: Code

Build a co-occurrence matrix from any public-domain book (e.g., Project Gutenberg). Apply truncated SVD (LSA) with k = 50. Report the 5 nearest neighbours of 3 content words of your choice.

✎

Exercise 12: Code

Implement Skip-gram with full softmax on a toy corpus (vocab ≤ 200). Verify the gradient with the finite-difference checker from Chapter 5.

✎

Exercise 13: Code

Extend exercise 12 with negative sampling (k=5). Plot wall-clock time per epoch for both versions as vocabulary grows from 100 to 5,000 words.

✎

Exercise 14: Code Lab

Implement GloVe from scratch on text8 (first 10MB). Train 50-d vectors for 25 epochs. Evaluate on 20 hand-picked analogy questions and report accuracy.

✎

Exercise 15: Code

Use gensim FastText to demonstrate OOV handling: train on a small corpus, then query 10 words NOT in the corpus (including 2 deliberate typos). Report nearest neighbours for each.

✎

Exercise 16: Code Lab

Download GloVe-100d via gensim. Measure anisotropy (mean random-pair cosine), apply all-but-the-top with k ∈ {1, 3, 5, 10}, and report Spearman ρ on WordSim-353 for each k. Which k is best?

✎

Exercise 17: Code

Visualise embedding geometry: select 60 words from 6 categories (animals, countries, verbs, colours, professions, numbers), reduce GloVe vectors to 2-D with PCA and t-SNE, and plot both. Which method preserves cluster structure better?

✎

Exercise 18: Code (Challenge)

Reproduce the polysemy experiment: collect 10 sentences each for two senses of 'bank' (or 'bass', 'spring'). Extract BERT embeddings for the target word in every sentence. Cluster with k-means (k=2) and report cluster purity. Then average all 20 vectors and find its nearest static GloVe neighbour — which sense wins?

Further reading: “Efficient Estimation of Word Representations in Vector Space” (Mikolov et al., 2013) and “Distributed Representations of Words and Phrases” (Mikolov et al., 2013b) — the original Word2Vec pair. “GloVe: Global Vectors for Word Representation” (Pennington et al., 2014). “Enriching Word Vectors with Subword Information” (Bojanowski et al., 2016) for fastText. “Neural Word Embedding as Implicit Matrix Factorization” (Levy & Goldberg, 2014) for the unifying theory. Jurafsky & Martin, “Speech and Language Processing” (3rd ed.) Chapter 6 for a textbook treatment.

Next → Chapter 9: Sequence Models: RNNs & LSTMs

Static embeddings give each word a fixed point in space, ignoring order entirely — “dog bites man” and “man bites dog” have identical bags of vectors. Chapter 9 introduces recurrent networks that read sequences token by token, maintaining a hidden state that accumulates context. We will meet the vanishing gradient problem head-on, see how LSTM gating solves it, and build the seq2seq-with-attention architecture whose limitations directly motivated the Transformer.

✎ 18 Exercises in this chapter

Attempt each exercise before checking the worked solutions.

View Solutions →

←

PreviousCh 7. Generative Models (Pre-deep learning)

NextCh 9. Sequence Models: RNNs & LSTMs

→

Embeddings & Representation Learning

Learning Objectives

From One-Hot to Dense Representations

The Problem with One-Hot Vectors

The Distributional Hypothesis

Co-occurrence Counts: The Raw Signal

From Counts to Embeddings: Three Strategies

Word2Vec: Skip-gram and CBOW

Two Architectures

The Skip-gram Objective

Negative Sampling

The SGNS Objective

Frequent Word Subsampling

GloVe: Global Vectors

The Key Insight: Ratios of Co-occurrence Probabilities

The GloVe Objective

Word2Vec vs. GloVe

fastText: Subword Embeddings

The Subword Decomposition

Why Subwords Win

The Geometry of Embedding Space

The Analogy Mechanism

Anisotropy: The Narrow Cone Problem

Evaluating Embeddings

The Polysemy Problem and the Road to Contextual Embeddings

The Superposition of Senses

The Conceptual Bridge to Part III

Chapter Summary & Exercises

Method Quick-Reference

Exercises