Linear Algebra for ML
When most people first learn NumPy, they think of a vector as simply “a list of numbers.” That view works for code but breaks down the moment you try to understand why cosine similarity measures semantic closeness, or why gradient descent moves perpendicular to loss contours. This section re-trains that intuition.
What is a Scalar?
A scalar is a single real number: temperature, learning rate, loss value. In Python:
# A scalar is just a Python float or a 0-dimensional tensor
import numpy as np
import torch
learning_rate = 3e-4
loss_value = np.float32(2.718)
torch_scalar = torch.tensor(42.0) # shape: torch.Size([])
print(torch_scalar.shape) # torch.Size([])
print(torch_scalar.ndim) # 0What is a Vector?
A vector is an ordered list of scalars, but geometrically it is an arrow in n-dimensional space — it has both magnitude (length) and direction. The vector [3, 4] points 3 units right and 4 units up in 2D space.
import numpy as np
import torch
# 1-D NumPy array → vector in ℝ³
v = np.array([3.0, 4.0, 0.0])
print(v.shape) # (3,)
# PyTorch equivalent
v_t = torch.tensor([3.0, 4.0, 0.0])
# Addition: tip-to-tail rule geometrically
a = np.array([1, 2])
b = np.array([3, -1])
print(a + b) # [4, 1] — walk a then b
# Scalar multiplication: stretches or flips the arrow
print(-2 * a) # [-2, -4] — reversed & doubledVector Norms — Measuring Length
The norm of a vector formalizes its “length.” Different norms penalize large values differently, which is why they produce different regularization behaviours in neural networks.
import numpy as np
v = np.array([3.0, -4.0, 0.0])
# ---- Implement from scratch ----
l1_manual = np.sum(np.abs(v)) # 3+4+0 = 7
l2_manual = np.sqrt(np.sum(v**2)) # √(9+16) = 5.0
linf_manual= np.max(np.abs(v)) # 4.0
# ---- NumPy built-ins (same results) ----
l1 = np.linalg.norm(v, ord=1) # 7.0
l2 = np.linalg.norm(v) # 5.0 (default)
linf = np.linalg.norm(v, ord=np.inf) # 4.0
# ---- Unit vector (normalize to length 1) ----
unit_v = v / np.linalg.norm(v) # [0.6, -0.8, 0.0]
print(np.linalg.norm(unit_v)) # 1.0Cosine Similarity — Direction Without Magnitude
Two vectors can point in the same direction but have very different lengths. Cosine similarity strips out magnitude and measures only directional alignment. It is the workhorse of semantic search.
import numpy as np
# Imagine these are word embeddings
king = np.array([0.9, 0.1, 0.8])
queen = np.array([0.85, 0.15, 0.75])
random= np.array([-0.3, 0.9, -0.1])
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(f"king–queen: {cosine_similarity(king, queen):.3f}") # ~0.999 (close!)
print(f"king–random:{cosine_similarity(king, random):.3f}") # ~-0.1 (unrelated)
# Note: L2 norms can differ wildly but similarity is pure angle
king_scaled = king * 100 # much larger magnitude
print(cosine_similarity(king, king_scaled)) # still 1.0 — same directionThe dot product is the engine behind matrix multiplication, attention scores, logits, and cosine similarity. Understanding it geometrically is not optional — it is the lens through which all of attention makes sense.
Algebraic Definition
import numpy as np
import torch
a = np.array([1.0, 2.0, 3.0])
b = np.array([4.0, 5.0, 6.0])
# Method 1: explicit loop (slowest)
dot_loop = sum(ai*bi for ai, bi in zip(a,b)) # 32.0
# Method 2: NumPy — vectorized, fast
dot_np = np.dot(a, b) # 32.0
dot_np2 = (a * b).sum() # 32.0
dot_np3 = a @ b # 32.0 (‘@’ is matmul)
# Method 3: PyTorch (runs on GPU)
at = torch.tensor(a)
bt = torch.tensor(b)
dot_torch= torch.dot(at, bt) # tensor(32.)Geometric Definition — Measuring Alignment
The same computation has a beautiful geometric meaning:
This means the dot product is large and positive when a and b point in the same direction (θ ≈ 0°, cos≈1), zero when they are perpendicular (θ = 90°, cos=0), and large and negative when they point in opposite directions (θ = 180°, cos=-1).
import numpy as np
a = np.array([1.0, 0.0]) # points right
b = np.array([0.0, 1.0]) # points up — perpendicular
c = np.array([1.0, 1.0]) # diagonal, 45°
print(np.dot(a, b)) # 0.0 (perpendicular)
print(np.dot(a, c)) # 1.0 (partial alignment)
# Verify algebraic == geometric
theta_rad = np.arccos(np.dot(a,c) / (np.linalg.norm(a)*np.linalg.norm(c)))
theta_deg = np.degrees(theta_rad)
print(f"angle = {theta_deg:.1f}°") # 45.0°Projections — Decomposing One Vector Along Another
Projection is how you decompose a vector into a component “along” a direction and a component perpendicular to it. This is the exact operation performed inside an attention head when computing how much each query aligns with each key.
import numpy as np
def project(a, b):
"""Project vector a onto the direction of b."""
b_hat = b / np.linalg.norm(b) # unit vector in b’s direction
scalar= np.dot(a, b_hat) # signed length of projection
return scalar * b_hat # vector along b
a = np.array([3.0, 4.0]) # the vector we’re decomposing
b = np.array([1.0, 0.0]) # x-axis direction
a_along_b = project(a, b) # [3. 0.] — the x-component
a_perp_b = a - a_along_b # [0. 4.] — the y-component
# Sanity check: perpendicular component ⊥ b
print(np.dot(a_perp_b, b)) # ≈0.0 (floating point)
print(a_along_b + a_perp_b) # [3. 4.] — reconstructs aEvery linear layer in a neural network is a matrix. Understanding what a matrix does geometrically — how it stretches, rotates, and projects the space of inputs — is essential for understanding why depth, width, and rank matter.
Matrix-Vector Multiplication
When you multiply a matrix W by a vector x, the result Wx is a new vector in a transformed space. Each row of W is a set of weights that “measures” one dimension of the output.
import numpy as np
import torch
import torch.nn as nn
# W: (3, 4) — maps ℝ⁴ to ℝ³
W = np.array([
[1, 0, 0, 0], # output dim 0 reads input dim 0 only
[0, 1, 1, 0], # output dim 1 adds input dims 1&2
[0, 0, 0, 1], # output dim 2 reads input dim 3 only], dtype=float)
x = np.array([2.0, 3.0, 4.0, 5.0])
y = W @ x # [2., 7., 5.]
# Equivalent to three separate dot products:
print(np.dot(W[0], x)) # 2.0
print(np.dot(W[1], x)) # 7.0 (3+4)
print(np.dot(W[2], x)) # 5.0
# In PyTorch, nn.Linear(4, 3) does exactly this + bias
layer = nn.Linear(4, 3, bias=False)
layer.weight.data = torch.FloatTensor(W)
print(layer(torch.FloatTensor(x))) # tensor([2., 7., 5.])Matrix-Matrix Multiplication
Composing two linear transformations — applying A then B — is equivalent to a single matrix multiplication BA. This is why deep networks can be understood as sequences of composed transformations.
import numpy as np
# A: ℝ³ → ℝ⁴, B: ℝ⁴ → ℝ²
A = np.random.randn(4, 3)
B = np.random.randn(2, 4)
BA = B @ A # (2,4) @ (4,3) = (2,3) — ℝ³ → ℝ² directly
print(BA.shape) # (2, 3)
# Verify: applying A then B equals BA in one step
x = np.random.randn(3)
two_step = B @ (A @ x) # apply A, then B
one_step = BA @ x # apply BA directly
print(np.allclose(two_step, one_step)) # True
# Key rule: inner dimensions must match
# (m,k) @ (k,n) = (m,n) — k must agreeTranspose, Symmetric Matrices & Identity
import numpy as np
A = np.array([[1,2,3], [4,5,6]])
print(A.shape) # (2, 3)
print(A.T.shape) # (3, 2) — rows become columns
# Gram matrix: AᵀA is always square and symmetric
G = A.T @ A # (3,2)@(2,3) = (3,3)
print(np.allclose(G, G.T)) # True (symmetric)
# Identity matrix: I @ x = x (no transformation)
I = np.eye(3)
x = np.array([7.0, 8.0, 9.0])
print(I @ x) # [7. 8. 9.] — unchangedRank — How Much Information Does a Matrix Carry?
The rank of a matrix is the number of linearly independent rows (or columns) — equivalently, the dimensionality of its output space. A matrix of rank r can only produce outputs in an r-dimensional subspace, no matter how many inputs you feed it.
import numpy as np
# Full-rank matrix: all rows independent
A = np.array([[1,0],[0,1]])
print(np.linalg.matrix_rank(A)) # 2 (full rank)
# Rank-deficient matrix: row 2 = 2 × row 1
B = np.array([[1,2],[2,4]])
print(np.linalg.matrix_rank(B)) # 1 (rank-deficient!)
# Low-rank product: the key idea behind LoRA
# A (768, 768) weight update as a low-rank product
r = 8 # LoRA rank
A_lora = np.random.randn(768, r)
B_lora = np.random.randn(r, 768) # r=8 << 768
delta_W = A_lora @ B_lora # (768,768) but rank 8
print(np.linalg.matrix_rank(delta_W)) # 8 (not 768!)
# Parameters: 768*8 + 8*768 = 12,288 vs 768*768 = 589,824Beyond standard matrix multiplication, several other operations appear repeatedly in deep learning code. Confusing them — especially Hadamard vs. matmul — is a common source of silent bugs.
Element-wise (Hadamard) Product
The Hadamard product multiplies corresponding elements. Both tensors must have the same shape. This appears in LSTM gates, attention masks, and dropout.
import numpy as np
A = np.array([[1,2],[3,4]])
B = np.array([[5,6],[7,8]])
hadamard = A * B # [[ 5,12],[21,32]] — element-wise
matmul = A @ B # [[19,22],[43,50]] — row-col dot products
# LSTM forget gate: f = sigmoid(Wf @ [h,x] + bf)
# Gate applied with Hadamard: cell = f * cell_prev
f = np.array([0.8, 0.1, 0.95]) # forget gate values
cell_prev= np.array([2.0, 5.0, 1.0]) # previous cell state
cell_new = f * cell_prev # [1.6, 0.5, 0.95]Outer Product — From Two Vectors to a Matrix
The outer product of two vectors creates a rank-1 matrix where every entry (i,j) equals a_i × b_j. Every attention pattern can be viewed as a weighted sum of outer products.
import numpy as np
a = np.array([1.0, 2.0, 3.0]) # query-like vector (3,)
b = np.array([4.0, 5.0]) # key-like vector (2,)
outer = np.outer(a, b) # shape (3, 2)
# [[ 4. 5.]
# [ 8. 10.]
# [12. 15.]]
# Equivalent: reshape + broadcast
outer2 = a[:, np.newaxis] * b[np.newaxis, :] # same result
print(np.allclose(outer, outer2)) # True
# Rank check: outer product is always rank 1
print(np.linalg.matrix_rank(outer)) # 1Broadcasting — NumPy’s Shape Extension Rules
Broadcasting is how NumPy (and PyTorch) handle operations between tensors of different shapes. It is enormously convenient but a major source of bugs when misunderstood. The rule: dimensions are compared right-to-left; a dimension of size 1 is “stretched” to match the other.
import numpy as np
# ---- Example 1: Add bias to every row ----
activations = np.random.randn(32, 768) # (batch, hidden)
bias = np.random.randn(768) # (hidden,) only
result = activations + bias # (32,768): bias broadcast over batch
# ---- Example 2: Normalise each sample independently ----
norms = np.linalg.norm(activations, axis=1, keepdims=True) # (32,1)
normalised = activations / norms # (32,768) / (32,1) — broadcast!
# ---- Common bug: shape mismatch ----
wrong_norms = np.linalg.norm(activations, axis=1) # (32,) NO keepdims
# activations / wrong_norms → ValueError: could not broadcast (32,768) / (32,)Eigenvectors are the special directions that a matrix leaves unchanged (only stretched or flipped). Eigenvalues measure by how much. These ideas underlie PCA, spectral norm regularization, and understanding what a weight matrix “prefers” to amplify.
Definition and Geometric Meaning
import numpy as np
# Symmetric matrix (appears in attention, covariance, etc.)
A = np.array([[3, 1], [1, 3]])
eigenvalues, eigenvectors = np.linalg.eig(A)
print(eigenvalues) # [4. 2.]
print(eigenvectors) # columns are eigenvectors
# [[ 0.707 -0.707]
# [ 0.707 0.707]]
# Verify: A @ v = λ @ v for each eigenpair
for i in range(len(eigenvalues)):
lam = eigenvalues[i]
vec = eigenvectors[:, i] # i-th column
print(np.allclose(A @ vec, lam * vec)) # True, TruePCA via Eigendecomposition
Principal Component Analysis (PCA) finds the directions of maximum variance in a dataset. It is exactly the eigendecomposition of the data covariance matrix. The first eigenvector (largest eigenvalue) is the “most important” direction.
import numpy as np
np.random.seed(42)
# Simulate 200 data points in ℝ² with a clear principal direction
X = np.random.randn(200, 2)
X[:, 1] = X[:, 0] * 2 + 0.3 * X[:, 1] # strong correlation
# Step 1: center the data
X_centered = X - X.mean(axis=0)
# Step 2: covariance matrix (XᵀX / n)
cov = X_centered.T @ X_centered / len(X) # (2,2)
# Step 3: eigendecomposition
vals, vecs = np.linalg.eigh(cov) # eigh for symmetric matrices
idx = np.argsort(vals)[::-1] # sort descending
vals, vecs = vals[idx], vecs[:, idx]
print(f"PC1 explains {vals[0]/vals.sum()*100:.1f}% of variance") # ~97%
print(f"PC1 direction: {vecs[:,0]}") # ~[0.45, 0.89]
# Step 4: project data onto top-1 component
X_pca = X_centered @ vecs[:, :1] # (200,1) compressed!SVD generalizes eigendecomposition to rectangular matrices. It is the mathematical foundation for low-rank approximation, and understanding it deeply unlocks LoRA, weight pruning, matrix factorization, and much of the modern interpretability literature.
The Decomposition: A = UΣVᵀ
Geometric story: to apply A to a vector x, you: (1) rotate x by Vᵀ into a canonical coordinate system, (2) scale each axis by the corresponding singular value σᵢ, and (3) rotate the result by U into the output space.
import numpy as np
# Example: a 3×2 matrix
A = np.array([[1, 2],[3, 4],[5, 6]], dtype=float)
U, s, Vt = np.linalg.svd(A, full_matrices=False) # economy SVD
print(U.shape) # (3,2) — left singular vectors
print(s.shape) # (2,) — singular values [9.52, 0.51]
print(Vt.shape) # (2,2) — right singular vectors (transposed)
# Reconstruct A from U, Σ, Vᵀ
A_reconstructed = U @ np.diag(s) @ Vt
print(np.allclose(A, A_reconstructed)) # True
# Singular values encode importance
print(f"Singular values: {s}") # [9.52, 0.51]
print(f"σ₁ explains {s[0]**2/np.sum(s**2)*100:.1f}% of variance") # 99.7%Truncated SVD — The Best Low-Rank Approximation
Keeping only the top k singular values and their vectors gives the best possible rank-k approximation of A (in the Frobenius norm sense). This is the Eckart–Young theorem — a fundamental result.
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
# Load a grayscale image as a 2D matrix
img = np.array(Image.open("photo.jpg").convert("L"), dtype=float)
U, s, Vt = np.linalg.svd(img, full_matrices=False)
def reconstruct(U, s, Vt, k):
"""Best rank-k approximation."""
return U[:, :k] @ np.diag(s[:k]) @ Vt[:k, :]
for k in [5, 20, 50, 100]:
approx = reconstruct(U, s, Vt, k)
error = np.linalg.norm(img - approx, 'fro') # Frobenius norm
pct_sv = s[:k].sum() / s.sum() * 100
print(f"k={k:3d}: error={error:.1f}, singular value energy={pct_sv:.1f}%")
# k= 5: error=..., singular value energy=72.1%
# k= 50: error=..., singular value energy=97.4%
# k=100: error=..., singular value energy=99.1%SVD and Weight Matrix Analysis
SVD applied to neural network weight matrices reveals their effective rank — how many truly independent directions they use. Overparameterized models often have weight matrices with rapidly decaying singular values, meaning a low-rank approximation captures most of the information.
import numpy as np
import torch
from transformers import GPT2Model
model = GPT2Model.from_pretrained('gpt2')
W = model.h[0].attn.c_attn.weight.detach().numpy() # (2304, 768)
_, s, _ = np.linalg.svd(W, full_matrices=False)
# Stable rank: softer measure of effective rank
stable_rank = (s**2).sum() / s[0]**2
print(f"Stable rank: {stable_rank:.1f} out of {len(s)} singular values")
# Cumulative energy: how many SVs capture 90% of energy?
energy = np.cumsum(s**2) / np.sum(s**2)
k_90 = np.searchsorted(energy, 0.9) + 1
print(f"{k_90} singular values capture 90% of energy")Every piece of data in a deep learning system lives in a tensor. Understanding tensor shapes, indexing, and the operations that transform them is as important as understanding the math. Shape bugs are the #1 source of wasted hours in ML engineering.
The Dimension Hierarchy
import torch
scalar = torch.tensor(3.14) # shape: []
vector = torch.tensor([1.0, 2.0, 3.0]) # shape: [3]
matrix = torch.zeros(4, 5) # shape: [4, 5]
tensor3= torch.zeros(2, 4, 5) # shape: [2, 4, 5]
tensor4= torch.zeros(8, 12, 128, 64) # shape: [8, 12, 128, 64]
# (B, H, T, D/H) in MHA
# ndim tells you the rank (number of axes)
print(tensor4.ndim) # 4
print(tensor4.shape) # torch.Size([8, 12, 128, 64])
print(tensor4.numel()) # 8*12*128*64 = 786_432 elementsThe (B, T, D) Convention
Almost all sequence models use the convention (Batch, Time/Sequence, Dimension). Learning to read and manipulate tensors in this layout is essential before studying the Transformer.
import torch
B = 8 # batch size: 8 sequences processed in parallel
T = 512 # sequence length: 512 tokens
D = 768 # model dimension (d_model in GPT-2 small)
H = 12 # number of attention heads
Dh= D // H # 64: per-head dimension
token_ids = torch.randint(0, 50257, (B, T)) # input token IDs
embeddings = torch.randn(B, T, D) # (8, 512, 768)
attn_scores= torch.randn(B, H, T, T) # (8, 12, 512, 512)
qkv_heads = torch.randn(B, H, T, Dh) # (8, 12, 512, 64)Reshaping, Permuting, Squeezing
These are the bread-and-butter operations for reshaping tensor layouts. Understanding them prevents the shape errors that plague beginners.
import torch
x = torch.randn(8, 512, 768) # (B, T, D)
# --- reshape: reinterpret elements in new shape ---
# Split D into H heads of size Dh each
x_heads = x.reshape(8, 512, 12, 64) # (B, T, H, Dh)
# --- permute: reorder axes (like np.transpose) ---
# Move H before T for attention computation
x_perm = x_heads.permute(0, 2, 1, 3) # (B, H, T, Dh)
print(x_perm.shape) # torch.Size([8, 12, 512, 64])
# --- squeeze / unsqueeze: add or remove size-1 dims ---
y = torch.randn(8, 1, 512) # has a spurious dim=1
y_sq = y.squeeze(1) # (8, 512)
y_us = y_sq.unsqueeze(0) # (1, 8, 512) — add batch-of-batches dim
# CAUTION: reshape is contiguous-aware
# After permute, tensor may not be contiguous
x_cont = x_perm.contiguous() # force contiguous before reshapeEinstein Summation — A Unified Notation
Einsum is a compact notation from Einstein notation that expresses any tensor contraction (dot product, matrix multiply, outer product, batched operations, etc.) in a single readable string. It is the lingua franca of efficient deep learning code.
import torch
# ---- Dot product: 'i,i->' ----
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])
print(torch.einsum('i,i->', a, b)) # 32.0
# ---- Matrix multiply: 'ij,jk->ik' ----
A = torch.randn(3, 4)
B = torch.randn(4, 5)
C = torch.einsum('ij,jk->ik', A, B) # (3,5)
# ---- Outer product: 'i,j->ij' ----
outer = torch.einsum('i,j->ij', a, b) # (3,3)
# ---- Batched attention: 'bhid,bhjd->bhij' ----
# Q, K: (batch, heads, seq_len, head_dim)
Q = torch.randn(8, 12, 128, 64)
K = torch.randn(8, 12, 128, 64)
scores = torch.einsum('bhid,bhjd->bhij', Q, K) # (8,12,128,128)
# This is EXACTLY the attention score computation!This chapter covered the core linear algebra operations that appear throughout deep learning. Below is a concise reference card, followed by the full exercise set.
Operations Reference
Shape Convention Cheat Sheet
Exercises
Complete all 20 exercises before moving to Chapter 2. The first 12 are pen-and-paper; the remaining 8 require code.
Further reading: Gilbert Strang, Introduction to Linear Algebra (5th ed.) — Chapters 1–6. Andrej Karpathy’s YouTube lecture “The spelled-out intro to neural networks” (builds these concepts interactively). 3Blue1Brown “Essence of Linear Algebra” playlist for visual intuition.
Next: Chapter 2 — Calculus & Optimization. We apply these tools to understand how gradients flow through computational graphs.