Part I: Mathematical Foundations
Chapter 1

Linear Algebra for ML

Vectors, matrices, tensors, eigendecomposition, and SVD
20 Exercises
1.1

When most people first learn NumPy, they think of a vector as simply “a list of numbers.” That view works for code but breaks down the moment you try to understand why cosine similarity measures semantic closeness, or why gradient descent moves perpendicular to loss contours. This section re-trains that intuition.

What is a Scalar?

A scalar is a single real number: temperature, learning rate, loss value. In Python:

PythonScalars
# A scalar is just a Python float or a 0-dimensional tensor

import numpy as np
import torch

learning_rate  = 3e-4
loss_value     = np.float32(2.718)
torch_scalar   = torch.tensor(42.0)   # shape: torch.Size([])

print(torch_scalar.shape)   # torch.Size([])
print(torch_scalar.ndim)    # 0

What is a Vector?

A vector is an ordered list of scalars, but geometrically it is an arrow in n-dimensional space — it has both magnitude (length) and direction. The vector [3, 4] points 3 units right and 4 units up in 2D space.

Vector
An ordered list of n real numbers, written as a column: v ∈ ℝⁿ. Geometrically, an arrow from the origin to point (v₁, v₂, …, vₙ).
PythonVectors in NumPy and PyTorch
import numpy as np
import torch

# 1-D NumPy array  →  vector in ℝ³
v = np.array([3.0, 4.0, 0.0])
print(v.shape)    # (3,)

# PyTorch equivalent
v_t = torch.tensor([3.0, 4.0, 0.0])

# Addition: tip-to-tail rule geometrically
a = np.array([1, 2])
b = np.array([3, -1])
print(a + b)      # [4, 1]  — walk a then b

# Scalar multiplication: stretches or flips the arrow
print(-2 * a)    # [-2, -4]  — reversed & doubled

Vector Norms — Measuring Length

The norm of a vector formalizes its “length.” Different norms penalize large values differently, which is why they produce different regularization behaviours in neural networks.

L1 norm: ||v||₁ = Σ|vᵢ| L2 norm: ||v||₂ = √(Σ vᵢ²) L∞ norm: ||v||∞ = max|vᵢ|
PythonNorms from scratch vs. NumPy
import numpy as np

v = np.array([3.0, -4.0, 0.0])

# ---- Implement from scratch ----
l1_manual = np.sum(np.abs(v))                # 3+4+0 = 7
l2_manual = np.sqrt(np.sum(v**2))         # √(9+16) = 5.0
linf_manual= np.max(np.abs(v))               # 4.0

# ---- NumPy built-ins (same results) ----
l1   = np.linalg.norm(v, ord=1)   # 7.0
l2   = np.linalg.norm(v)           # 5.0  (default)
linf = np.linalg.norm(v, ord=np.inf)  # 4.0

# ---- Unit vector (normalize to length 1) ----
unit_v = v / np.linalg.norm(v)       # [0.6, -0.8, 0.0]
print(np.linalg.norm(unit_v))       # 1.0
ML Connection: Gradient Clipping & Regularization
L2 norm of the gradient vector is computed before clipping: if ||g||_2 > threshold, g ← g × (threshold / ||g||_2). L1 regularization adds λ||w||_1 to the loss, promoting sparsity. L2 regularization adds λ||w||_2², shrinking weights uniformly (weight decay in AdamW).

Cosine Similarity — Direction Without Magnitude

Two vectors can point in the same direction but have very different lengths. Cosine similarity strips out magnitude and measures only directional alignment. It is the workhorse of semantic search.

cos(θ) = (a · b) / (||a||_2 × ||b||_2) ∈ [-1, 1]
PythonCosine similarity — the core of semantic search
import numpy as np

# Imagine these are word embeddings
king  = np.array([0.9, 0.1, 0.8])
queen = np.array([0.85, 0.15, 0.75])
random= np.array([-0.3, 0.9, -0.1])

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(f"king–queen: {cosine_similarity(king, queen):.3f}")  # ~0.999 (close!)
print(f"king–random:{cosine_similarity(king, random):.3f}")  # ~-0.1  (unrelated)

# Note: L2 norms can differ wildly but similarity is pure angle
king_scaled = king * 100    # much larger magnitude
print(cosine_similarity(king, king_scaled))  # still 1.0 — same direction
Geometric Intuition: Why Cosine?
When you ask "are these two texts semantically similar?", you are really asking: do the model’s embedding vectors point in roughly the same direction in the high-dimensional space?
The magnitude of an embedding often encodes frequency or confidence, not meaning. Cosine removes that signal and keeps only the angle. A rarely-occurring technical term and a common synonym may have very different norms, but if the model has learned well, they will point in the same direction.
Code Lab 1.1 — Norm Profiling
1. Generate a random matrix M of shape (10000, 768) — simulating a batch of BERT embeddings.
2. Compute the L2 norm of every row using: (a) a Python loop, (b) np.linalg.norm with axis=1, (c) torch with .norm(dim=1).
3. Time each method with %timeit. You should see a ~100x speed difference between the loop and the vectorized versions.
4. Challenge: implement L2 normalization (so every row has norm 1) in a single NumPy expression.
1.2

The dot product is the engine behind matrix multiplication, attention scores, logits, and cosine similarity. Understanding it geometrically is not optional — it is the lens through which all of attention makes sense.

Algebraic Definition

a · b = aᵀb = Σ(i) a_i × b_i (element-wise multiply, then sum)
PythonDot product — three equivalent ways
import numpy as np
import torch

a = np.array([1.0, 2.0, 3.0])
b = np.array([4.0, 5.0, 6.0])

# Method 1: explicit loop (slowest)
dot_loop = sum(ai*bi for ai, bi in zip(a,b))   # 32.0

# Method 2: NumPy — vectorized, fast
dot_np   = np.dot(a, b)                  # 32.0
dot_np2  = (a * b).sum()                 # 32.0
dot_np3  = a @ b                         # 32.0  (‘@’ is matmul)

# Method 3: PyTorch (runs on GPU)
at = torch.tensor(a)
bt = torch.tensor(b)
dot_torch= torch.dot(at, bt)             # tensor(32.)

Geometric Definition — Measuring Alignment

The same computation has a beautiful geometric meaning:

a · b = ||a||_2 × ||b||_2 × cos(θ)

This means the dot product is large and positive when a and b point in the same direction (θ ≈ 0°, cos≈1), zero when they are perpendicular (θ = 90°, cos=0), and large and negative when they point in opposite directions (θ = 180°, cos=-1).

PythonVerify the geometric identity
import numpy as np

a = np.array([1.0, 0.0])  # points right
b = np.array([0.0, 1.0])  # points up — perpendicular
c = np.array([1.0, 1.0])  # diagonal, 45°

print(np.dot(a, b))  # 0.0  (perpendicular)
print(np.dot(a, c))  # 1.0  (partial alignment)

# Verify algebraic == geometric
theta_rad = np.arccos(np.dot(a,c) / (np.linalg.norm(a)*np.linalg.norm(c)))
theta_deg = np.degrees(theta_rad)
print(f"angle = {theta_deg:.1f}°")  # 45.0°

Projections — Decomposing One Vector Along Another

Projection is how you decompose a vector into a component “along” a direction and a component perpendicular to it. This is the exact operation performed inside an attention head when computing how much each query aligns with each key.

proj_b(a) = (a · b / ||b||^2) × b — scalar coefficient times direction
PythonProjection onto a direction
import numpy as np

def project(a, b):
    """Project vector a onto the direction of b."""
    b_hat = b / np.linalg.norm(b)          # unit vector in b’s direction
    scalar= np.dot(a, b_hat)               # signed length of projection
    return scalar * b_hat                  # vector along b

a = np.array([3.0, 4.0])   # the vector we’re decomposing
b = np.array([1.0, 0.0])   # x-axis direction

a_along_b = project(a, b)            # [3. 0.]  — the x-component
a_perp_b  = a - a_along_b            # [0. 4.]  — the y-component

# Sanity check: perpendicular component ⊥ b
print(np.dot(a_perp_b, b))        # ≈0.0 (floating point)
print(a_along_b + a_perp_b)       # [3. 4.]  — reconstructs a
ML Connection: Attention Scores as Dot Products
In the Transformer, attention scores are computed as: score(Q_i, K_j) = (Q_i · K_j) / √dim
Each query Q_i is projected onto every key K_j. A high dot product means the query and key vectors are well-aligned — the model “chooses” to attend to that position. The √dim denominator prevents the dot products from growing so large that softmax saturates.
1.3

Every linear layer in a neural network is a matrix. Understanding what a matrix does geometrically — how it stretches, rotates, and projects the space of inputs — is essential for understanding why depth, width, and rank matter.

Matrix-Vector Multiplication

When you multiply a matrix W by a vector x, the result Wx is a new vector in a transformed space. Each row of W is a set of weights that “measures” one dimension of the output.

y = Wx y_i = Σ_j W_{ij} × x_j (dot product of row i with x)
PythonMatrix-vector multiply — a linear layer
import numpy as np
import torch
import torch.nn as nn

# W: (3, 4) — maps ℝ⁴ to ℝ³
W = np.array([
    [1, 0, 0, 0],   # output dim 0 reads input dim 0 only
    [0, 1, 1, 0],   # output dim 1 adds input dims 1&2
    [0, 0, 0, 1],   # output dim 2 reads input dim 3 only
Python
], dtype=float)

x = np.array([2.0, 3.0, 4.0, 5.0])
y = W @ x   # [2., 7., 5.]

# Equivalent to three separate dot products:
print(np.dot(W[0], x))  # 2.0
print(np.dot(W[1], x))  # 7.0  (3+4)
print(np.dot(W[2], x))  # 5.0

# In PyTorch, nn.Linear(4, 3) does exactly this + bias
layer = nn.Linear(4, 3, bias=False)
layer.weight.data = torch.FloatTensor(W)
print(layer(torch.FloatTensor(x)))  # tensor([2., 7., 5.])

Matrix-Matrix Multiplication

Composing two linear transformations — applying A then B — is equivalent to a single matrix multiplication BA. This is why deep networks can be understood as sequences of composed transformations.

PythonMatrix multiply — composing transformations
import numpy as np

# A: ℝ³ → ℝ⁴,   B: ℝ⁴ → ℝ²
A = np.random.randn(4, 3)
B = np.random.randn(2, 4)

BA = B @ A         # (2,4) @ (4,3) = (2,3)  — ℝ³ → ℝ² directly
print(BA.shape)      # (2, 3)

# Verify: applying A then B equals BA in one step
x = np.random.randn(3)
two_step = B @ (A @ x)   # apply A, then B
one_step = BA @ x         # apply BA directly
print(np.allclose(two_step, one_step))  # True

# Key rule: inner dimensions must match
# (m,k) @ (k,n) = (m,n)   — k must agree

Transpose, Symmetric Matrices & Identity

PythonTranspose and special matrices
import numpy as np

A = np.array([[1,2,3], [4,5,6]])
print(A.shape)     # (2, 3)
print(A.T.shape)   # (3, 2) — rows become columns

# Gram matrix: AᵀA is always square and symmetric
G = A.T @ A    # (3,2)@(2,3) = (3,3)
print(np.allclose(G, G.T))  # True (symmetric)

# Identity matrix: I @ x = x (no transformation)
I = np.eye(3)
x = np.array([7.0, 8.0, 9.0])
print(I @ x)  # [7. 8. 9.] — unchanged
ML Connection: Why WᵀW Appears Everywhere
The matrix product WᵀW (shape: d×d) appears in: (1) computing the covariance of layer activations for normalization, (2) the Gram matrix used in neural style transfer, (3) the Fisher information matrix in second-order optimizers. Because WᵀW is symmetric, its eigenvectors are orthogonal — which gives it very nice properties for analysis.

Rank — How Much Information Does a Matrix Carry?

The rank of a matrix is the number of linearly independent rows (or columns) — equivalently, the dimensionality of its output space. A matrix of rank r can only produce outputs in an r-dimensional subspace, no matter how many inputs you feed it.

PythonComputing and understanding rank
import numpy as np

# Full-rank matrix: all rows independent
A = np.array([[1,0],[0,1]])
print(np.linalg.matrix_rank(A))  # 2 (full rank)

# Rank-deficient matrix: row 2 = 2 × row 1
B = np.array([[1,2],[2,4]])
print(np.linalg.matrix_rank(B))  # 1 (rank-deficient!)

# Low-rank product: the key idea behind LoRA
# A (768, 768) weight update as a low-rank product
r = 8                   # LoRA rank
A_lora = np.random.randn(768, r)
B_lora = np.random.randn(r, 768)  # r=8 << 768
delta_W = A_lora @ B_lora           # (768,768) but rank 8
print(np.linalg.matrix_rank(delta_W)) # 8 (not 768!)
# Parameters: 768*8 + 8*768 = 12,288  vs  768*768 = 589,824
Why Low-Rank Adapters (LoRA) Work
The hypothesis behind LoRA is that the weight updates ΔW needed for fine-tuning a pretrained LLM lie in a low-dimensional subspace. Instead of learning a full (d×d) update matrix, LoRA learns two small matrices A (d×r) and B (r×d) where r << d.
Training parameters: 2 × d × r instead of d². For d=768, r=8: 12,288 instead of 589,824 — a 48× reduction, while often matching full fine-tuning quality.
Code Lab 1.2 — The Speed of Matrix Multiply
1. Build a matrix multiply from scratch using three nested for-loops. Time it on a (100,100) @ (100,100) problem.
2. Compare to np.dot, np.matmul, and A @ B on the same problem.
3. Move to GPU: repeat with torch.mm on CUDA. Try sizes 256, 1024, 4096.
4. Plot a bar chart of GFLOP/s achieved. Discuss: why is the GPU only faster at large sizes?
1.4

Beyond standard matrix multiplication, several other operations appear repeatedly in deep learning code. Confusing them — especially Hadamard vs. matmul — is a common source of silent bugs.

Element-wise (Hadamard) Product

The Hadamard product multiplies corresponding elements. Both tensors must have the same shape. This appears in LSTM gates, attention masks, and dropout.

(A ⊙ B)_{ij} = A_{ij} × B_{ij} (same shape as A and B)
PythonHadamard product vs. matrix multiply
import numpy as np

A = np.array([[1,2],[3,4]])
B = np.array([[5,6],[7,8]])

hadamard = A * B    # [[ 5,12],[21,32]] — element-wise
matmul   = A @ B    # [[19,22],[43,50]] — row-col dot products

# LSTM forget gate: f = sigmoid(Wf @ [h,x] + bf)
# Gate applied with Hadamard: cell = f * cell_prev
f        = np.array([0.8, 0.1, 0.95]) # forget gate values
cell_prev= np.array([2.0, 5.0, 1.0]) # previous cell state
cell_new = f * cell_prev              # [1.6, 0.5, 0.95]

Outer Product — From Two Vectors to a Matrix

The outer product of two vectors creates a rank-1 matrix where every entry (i,j) equals a_i × b_j. Every attention pattern can be viewed as a weighted sum of outer products.

(a ⊗ b)_{ij} = a_i × b_j shape: (len(a), len(b))
PythonOuter product — attention patterns as outer products
import numpy as np

a = np.array([1.0, 2.0, 3.0])  # query-like vector (3,)
b = np.array([4.0, 5.0])         # key-like vector (2,)

outer = np.outer(a, b)     # shape (3, 2)
# [[ 4.  5.]
#  [ 8. 10.]
#  [12. 15.]]

# Equivalent: reshape + broadcast
outer2 = a[:, np.newaxis] * b[np.newaxis, :]  # same result
print(np.allclose(outer, outer2))         # True

# Rank check: outer product is always rank 1
print(np.linalg.matrix_rank(outer))       # 1

Broadcasting — NumPy’s Shape Extension Rules

Broadcasting is how NumPy (and PyTorch) handle operations between tensors of different shapes. It is enormously convenient but a major source of bugs when misunderstood. The rule: dimensions are compared right-to-left; a dimension of size 1 is “stretched” to match the other.

PythonBroadcasting rules with examples
import numpy as np

# ---- Example 1: Add bias to every row ----
activations = np.random.randn(32, 768) # (batch, hidden)
bias        = np.random.randn(768)        # (hidden,) only
result      = activations + bias          # (32,768): bias broadcast over batch

# ---- Example 2: Normalise each sample independently ----
norms = np.linalg.norm(activations, axis=1, keepdims=True) # (32,1)
normalised = activations / norms     # (32,768) / (32,1) — broadcast!

# ---- Common bug: shape mismatch ----
wrong_norms = np.linalg.norm(activations, axis=1)    # (32,) NO keepdims
# activations / wrong_norms  → ValueError: could not broadcast (32,768) / (32,)
⚠️
Broadcasting Bug
The silent version: if your dimensions accidentally match, NumPy will happily broadcast in the wrong direction. Always print tensor shapes after broadcasting operations during development.
Golden rule: when dividing by norms, always use keepdims=True or reshape explicitly.
1.5

Eigenvectors are the special directions that a matrix leaves unchanged (only stretched or flipped). Eigenvalues measure by how much. These ideas underlie PCA, spectral norm regularization, and understanding what a weight matrix “prefers” to amplify.

Definition and Geometric Meaning

Eigenvector
A non-zero vector v such that Av = λv for some scalar λ. The matrix only scales v — it does not rotate it.
Eigenvalue
The scalar λ in Av = λv. It tells us how much the matrix stretches (λ > 1), shrinks (0 < λ < 1), flips (λ < 0), or zeroes out (λ = 0) that direction.
PythonComputing eigenvalues and eigenvectors
import numpy as np

# Symmetric matrix (appears in attention, covariance, etc.)
A = np.array([[3, 1], [1, 3]])

eigenvalues, eigenvectors = np.linalg.eig(A)
print(eigenvalues)    # [4. 2.]
print(eigenvectors)   # columns are eigenvectors
# [[ 0.707  -0.707]
#  [ 0.707   0.707]]

# Verify: A @ v = λ @ v  for each eigenpair
for i in range(len(eigenvalues)):
    lam = eigenvalues[i]
    vec = eigenvectors[:, i]  # i-th column
    print(np.allclose(A @ vec, lam * vec))  # True, True

PCA via Eigendecomposition

Principal Component Analysis (PCA) finds the directions of maximum variance in a dataset. It is exactly the eigendecomposition of the data covariance matrix. The first eigenvector (largest eigenvalue) is the “most important” direction.

PythonPCA from scratch
import numpy as np

np.random.seed(42)
# Simulate 200 data points in ℝ² with a clear principal direction
X = np.random.randn(200, 2)
X[:, 1] = X[:, 0] * 2 + 0.3 * X[:, 1]   # strong correlation

# Step 1: center the data
X_centered = X - X.mean(axis=0)

# Step 2: covariance matrix (XᵀX / n)
cov = X_centered.T @ X_centered / len(X)  # (2,2)

# Step 3: eigendecomposition
vals, vecs = np.linalg.eigh(cov)  # eigh for symmetric matrices
idx         = np.argsort(vals)[::-1]  # sort descending
vals, vecs  = vals[idx], vecs[:, idx]

print(f"PC1 explains {vals[0]/vals.sum()*100:.1f}% of variance")  # ~97%
print(f"PC1 direction: {vecs[:,0]}")              # ~[0.45, 0.89]

# Step 4: project data onto top-1 component
X_pca = X_centered @ vecs[:, :1]   # (200,1) compressed!
ML Connection: PCA in LLM Research
PCA (or SVD) of weight matrices reveals structure in large models. Researchers have found that fine-tuned models differ from pretrained ones mostly in a low-dimensional principal subspace — the empirical justification for LoRA.
PCA is also used to visualize token embeddings (reducing 768-d to 2-d for plotting), to diagnose training instabilities (do activation norms collapse?), and to analyse attention heads (do different heads attend to different principal directions?).
Code Lab 1.3 — PCA on Embeddings
1. Load GPT-2 tokenizer embeddings from transformers (shape: 50257 × 768).
2. Run PCA (using np.linalg.svd) to reduce to 2D. Plot the 100 most common English words.
3. How much variance is explained by the first 10 components? First 100?
4. Challenge: implement PCA using the power iteration method (no linalg.eigh).
1.6

SVD generalizes eigendecomposition to rectangular matrices. It is the mathematical foundation for low-rank approximation, and understanding it deeply unlocks LoRA, weight pruning, matrix factorization, and much of the modern interpretability literature.

The Decomposition: A = UΣVᵀ

SVD
Any matrix A ∈ ℝ^(m×n) can be written as A = UΣVᵀ where U (m×m) and V (n×n) are orthogonal matrices (rotations/reflections) and Σ (m×n) is diagonal with non-negative entries σ₁ ≥ σ₂ ≥ … ≥ 0.

Geometric story: to apply A to a vector x, you: (1) rotate x by Vᵀ into a canonical coordinate system, (2) scale each axis by the corresponding singular value σᵢ, and (3) rotate the result by U into the output space.

PythonSVD from scratch vs. NumPy
import numpy as np

# Example: a 3×2 matrix
A = np.array([[1, 2],[3, 4],[5, 6]], dtype=float)

U, s, Vt = np.linalg.svd(A, full_matrices=False)  # economy SVD
print(U.shape)   # (3,2) — left singular vectors
print(s.shape)   # (2,)  — singular values [9.52, 0.51]
print(Vt.shape)  # (2,2) — right singular vectors (transposed)

# Reconstruct A from U, Σ, Vᵀ
A_reconstructed = U @ np.diag(s) @ Vt
print(np.allclose(A, A_reconstructed))  # True

# Singular values encode importance
print(f"Singular values: {s}")  # [9.52, 0.51]
print(f"σ₁ explains {s[0]**2/np.sum(s**2)*100:.1f}% of variance")  # 99.7%

Truncated SVD — The Best Low-Rank Approximation

Keeping only the top k singular values and their vectors gives the best possible rank-k approximation of A (in the Frobenius norm sense). This is the Eckart–Young theorem — a fundamental result.

A_k = Σ_{i=1}^{k} σ_i × u_i × v_iᵀ (sum of k rank-1 matrices)
PythonImage compression via truncated SVD
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt

# Load a grayscale image as a 2D matrix
img   = np.array(Image.open("photo.jpg").convert("L"), dtype=float)
U, s, Vt = np.linalg.svd(img, full_matrices=False)

def reconstruct(U, s, Vt, k):
    """Best rank-k approximation."""
    return U[:, :k] @ np.diag(s[:k]) @ Vt[:k, :]

for k in [5, 20, 50, 100]:
    approx = reconstruct(U, s, Vt, k)
    error  = np.linalg.norm(img - approx, 'fro')   # Frobenius norm
    pct_sv = s[:k].sum() / s.sum() * 100
    print(f"k={k:3d}: error={error:.1f}, singular value energy={pct_sv:.1f}%")
# k=  5: error=..., singular value energy=72.1%
# k= 50: error=..., singular value energy=97.4%
# k=100: error=..., singular value energy=99.1%

SVD and Weight Matrix Analysis

SVD applied to neural network weight matrices reveals their effective rank — how many truly independent directions they use. Overparameterized models often have weight matrices with rapidly decaying singular values, meaning a low-rank approximation captures most of the information.

PythonAnalysing effective rank of a weight matrix
import numpy as np
import torch
from transformers import GPT2Model

model = GPT2Model.from_pretrained('gpt2')
W     = model.h[0].attn.c_attn.weight.detach().numpy()  # (2304, 768)

_, s, _ = np.linalg.svd(W, full_matrices=False)

# Stable rank: softer measure of effective rank
stable_rank = (s**2).sum() / s[0]**2
print(f"Stable rank: {stable_rank:.1f} out of {len(s)} singular values")

# Cumulative energy: how many SVs capture 90% of energy?
energy   = np.cumsum(s**2) / np.sum(s**2)
k_90     = np.searchsorted(energy, 0.9) + 1
print(f"{k_90} singular values capture 90% of energy")
ML Connection: LoRA Through the SVD Lens
LoRA replaces a weight update ΔW with two small matrices A (d×r) and B (r×d). If you run SVD on ΔW after full fine-tuning, you typically find that the top-r singular values capture >95% of the update energy — this is the empirical evidence that weight updates are intrinsically low-rank.
Spectral norm regularization (used in some GAN training) constrains the largest singular value σ₁ ≤ 1, preventing the weight matrix from amplifying any direction too strongly.
Code Lab 1.4 — SVD Image Compression
1. Download any grayscale image. Compute its full SVD.
2. Plot reconstruction error (Frobenius norm) vs. k for k = 1, 2, 5, 10, 20, 50, 100.
3. Plot singular value spectrum on a log scale. What shape does it follow?
4. Compute the compression ratio: original bytes = m×n, compressed bytes = k×(m+n+1). At what k do you break even?
5. Challenge: compare the image quality (visually) at the k where compression ratio = 1.0.
1.7

Every piece of data in a deep learning system lives in a tensor. Understanding tensor shapes, indexing, and the operations that transform them is as important as understanding the math. Shape bugs are the #1 source of wasted hours in ML engineering.

The Dimension Hierarchy

PythonScalar → Vector → Matrix → Tensor
import torch

scalar = torch.tensor(3.14)                     # shape: []
vector = torch.tensor([1.0, 2.0, 3.0])         # shape: [3]
matrix = torch.zeros(4, 5)                  # shape: [4, 5]
tensor3= torch.zeros(2, 4, 5)               # shape: [2, 4, 5]
tensor4= torch.zeros(8, 12, 128, 64)        # shape: [8, 12, 128, 64]
                                            # (B,  H,   T,   D/H) in MHA

# ndim tells you the rank (number of axes)
print(tensor4.ndim)   # 4
print(tensor4.shape)  # torch.Size([8, 12, 128, 64])
print(tensor4.numel()) # 8*12*128*64 = 786_432 elements

The (B, T, D) Convention

Almost all sequence models use the convention (Batch, Time/Sequence, Dimension). Learning to read and manipulate tensors in this layout is essential before studying the Transformer.

PythonStandard Transformer tensor shapes
import torch

B = 8    # batch size: 8 sequences processed in parallel
T = 512  # sequence length: 512 tokens
D = 768  # model dimension (d_model in GPT-2 small)
H = 12   # number of attention heads
Dh= D // H   # 64: per-head dimension

token_ids  = torch.randint(0, 50257, (B, T))   # input token IDs
embeddings = torch.randn(B, T, D)              # (8, 512, 768)
attn_scores= torch.randn(B, H, T, T)           # (8, 12, 512, 512)
qkv_heads  = torch.randn(B, H, T, Dh)          # (8, 12, 512, 64)

Reshaping, Permuting, Squeezing

These are the bread-and-butter operations for reshaping tensor layouts. Understanding them prevents the shape errors that plague beginners.

PythonReshape, permute, squeeze — with mental model
import torch

x = torch.randn(8, 512, 768)  # (B, T, D)

# --- reshape: reinterpret elements in new shape ---
# Split D into H heads of size Dh each
x_heads = x.reshape(8, 512, 12, 64)   # (B, T, H, Dh)

# --- permute: reorder axes (like np.transpose) ---
# Move H before T for attention computation
x_perm  = x_heads.permute(0, 2, 1, 3) # (B, H, T, Dh)
print(x_perm.shape)  # torch.Size([8, 12, 512, 64])

# --- squeeze / unsqueeze: add or remove size-1 dims ---
y = torch.randn(8, 1, 512)   # has a spurious dim=1
y_sq = y.squeeze(1)              # (8, 512)
y_us = y_sq.unsqueeze(0)          # (1, 8, 512) — add batch-of-batches dim

# CAUTION: reshape is contiguous-aware
# After permute, tensor may not be contiguous
x_cont = x_perm.contiguous()  # force contiguous before reshape

Einstein Summation — A Unified Notation

Einsum is a compact notation from Einstein notation that expresses any tensor contraction (dot product, matrix multiply, outer product, batched operations, etc.) in a single readable string. It is the lingua franca of efficient deep learning code.

PythonEinsum: one notation to rule them all
import torch

# ---- Dot product: 'i,i->' ----
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])
print(torch.einsum('i,i->', a, b))  # 32.0

# ---- Matrix multiply: 'ij,jk->ik' ----
A = torch.randn(3, 4)
B = torch.randn(4, 5)
C = torch.einsum('ij,jk->ik', A, B)  # (3,5)

# ---- Outer product: 'i,j->ij' ----
outer = torch.einsum('i,j->ij', a, b)  # (3,3)

# ---- Batched attention: 'bhid,bhjd->bhij' ----
# Q, K: (batch, heads, seq_len, head_dim)
Q = torch.randn(8, 12, 128, 64)
K = torch.randn(8, 12, 128, 64)
scores = torch.einsum('bhid,bhjd->bhij', Q, K)  # (8,12,128,128)
# This is EXACTLY the attention score computation!
ML Connection: Einsum in Attention
The entire multi-head self-attention computation can be written in 4 einsum operations: (1) project Q, K, V from embeddings, (2) compute attention scores bhid,bhjd→bhij, (3) apply softmax, (4) aggregate values bhij,bhjd→bhid.
Libraries like opt_einsum automatically choose the most efficient contraction order (order matters — wrong order can be 100× slower).
1.8

This chapter covered the core linear algebra operations that appear throughout deep learning. Below is a concise reference card, followed by the full exercise set.

Operations Reference

Shape Convention Cheat Sheet

Exercises

Complete all 20 exercises before moving to Chapter 2. The first 12 are pen-and-paper; the remaining 8 require code.

Exercise 1: Geometry
Draw vectors a=[1,2] and b=[3,0] on a grid. Compute their dot product and find the angle between them.
Exercise 2: Norms
Compute L1, L2, and L∞ norms of v=[3,-4,0,5,-2]. Which is largest? Which smallest?
Exercise 3: Unit Vectors
Normalize v=[1,1,1,1] to a unit vector. Verify its L2 norm is 1.
Exercise 4: Cosine
Compute cosine similarity between a=[1,0,1] and b=[0,1,1]. Interpret geometrically.
Exercise 5: Projection
Project u=[4,3] onto the x-axis direction [1,0]. What is the perpendicular component?
Exercise 6: Matrix Multiply
By hand, compute AB where A=[[1,2],[3,4]] and B=[[5,6],[7,8]].
Exercise 7: Rank
What is the rank of [[1,2,3],[2,4,6],[3,6,9]]? Why?
Exercise 8: Transpose
Show that (AB)ᵀ = BᵀAᵀ for 2×2 matrices A and B.
Exercise 9: Symmetric
Prove that WᵀW is always symmetric.
Exercise 10: Hadamard vs Matmul
For A=[[1,2],[3,4]] and B=[[5,0],[0,5]], compute both A⊙B and AB. What do you notice about AB?
Exercise 11: Eigenvalues
Find eigenvalues of [[2,1],[1,2]] by solving det(A-λI)=0.
Exercise 12: SVD by Hand
SVD-decompose the rank-1 matrix [[2,4],[1,2]] (hint: it has only one nonzero singular value).
Exercise 13: Code: Broadcasting
Create a (100,4) matrix of random activations. Subtract the mean of each of the 4 columns in a single line of NumPy.
Exercise 14: Code: Cosine Search
Embed 10 random “words” as 8-dimensional vectors. Build a cosine similarity matrix. Find the most similar pair.
Exercise 15: Code: PCA
Generate 500 points with covariance [[4,2],[2,1]]. Run PCA. Verify the first principal component aligns with [0.894, 0.447].
Exercise 16: Code: SVD compression
Compress a 256×256 random matrix with k=10,50. Report error and compression ratio.
Exercise 17: Code: Einsum
Implement batched matrix multiply (B,N,M) × (B,M,K) → (B,N,K) using einsum. Verify against torch.bmm.
Exercise 18: Code: Gradient norm
Create a random gradient tensor of shape (768, 3072). Compute its L2 norm. Clip it to max_norm=1.0 in one line.
Exercise 19: Code: LoRA rank analysis
Load any pretrained nn.Linear layer. Compute the SVD of its weight. Plot singular values. How many are needed to capture 90% of energy?
Exercise 20: Challenge: Power Iteration
Implement the power iteration algorithm to find the largest eigenvalue of a 100×100 random symmetric matrix. Compare to np.linalg.eigvalsh.

Further reading: Gilbert Strang, Introduction to Linear Algebra (5th ed.) — Chapters 1–6. Andrej Karpathy’s YouTube lecture “The spelled-out intro to neural networks” (builds these concepts interactively). 3Blue1Brown “Essence of Linear Algebra” playlist for visual intuition.

Next: Chapter 2 — Calculus & Optimization. We apply these tools to understand how gradients flow through computational graphs.

20 Exercises in this chapter
Attempt each exercise before checking the worked solutions.
View Solutions →