Tokenization
Every neural network, including the largest Transformers, is built from a single repeated unit: the artificial neuron. Its simplest form, the perceptron (Rosenblatt, 1958), computes a weighted sum of its inputs, adds a bias, and applies a step function. Understanding its capabilities — and its famous failure — is the foundation for everything that follows.
The Perceptron Model
The perceptron is a linear classifier: w · x + b = 0 defines a hyperplane, and the step function assigns each side to a class. It is exactly the logistic regression of Chapter 6 with a hard threshold replacing the sigmoid.
The XOR Problem
XOR (exclusive or) outputs 1 when exactly one input is 1. The four points (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0 cannot be separated by any single straight line — they are not linearly separable. A single perceptron fails. This is the limitation that motivates depth.
import numpy as np
class Perceptron:
def __init__(self, dim, lr=0.1):
self.w = np.zeros(dim); self.b = 0.0; self.lr = lr
def predict(self, x): return int(x @ self.w + self.b >= 0)
def fit(self, X, y, epochs=100):
for _ in range(epochs):
for xi, yi in zip(X, y):
pred = self.predict(xi)
# Perceptron learning rule: update on mistakes
self.w += self.lr * (yi - pred) * xi
self.b += self.lr * (yi - pred)
# Linearly separable AND gate: works perfectly
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y_and = np.array([0,0,0,1])
p = Perceptron(2); p.fit(X, y_and)
print("AND:", [p.predict(x) for x in X]) # [0, 0, 0, 1] correct
# XOR: never converges -- not linearly separable
y_xor = np.array([0,1,1,0])
p2 = Perceptron(2); p2.fit(X, y_xor, epochs=1000)
print("XOR:", [p2.predict(x) for x in X]) # wrong, no matter how long
# A single perceptron CANNOT represent XOR. We need a hidden layer.The fix for XOR — and for every non-linearly-separable problem — is to stack layers of neurons, separated by nonlinear activation functions. A multi-layer perceptron (MLP) transforms its input through a sequence of linear projections and nonlinearities, learning a representation in which the problem becomes separable.
The MLP Forward Pass
h = σ(W₁ x + b₁) # hidden layer (D → H)
y = W₂ h + b₂ # output layer (H → K)
σ = nonlinear activation (ReLU, GELU, …)The nonlinearity σ is essential. Without it, stacking linear layers collapses to a single linear layer: W₂(W₁ x) = (W₂ W₁) x. The nonlinearity is what gives depth its power — each layer can warp the space so the next layer's linear boundary becomes useful.
The Universal Approximation Theorem
Cybenko (1989) and Hornik (1991) proved that an MLP with a single hidden layer and a sufficient number of neurons can approximate any continuous function on a compact domain to arbitrary accuracy. This is the theoretical guarantee that neural networks are expressive enough for any task.
import numpy as np
np.random.seed(0)
X = np.array([[0,0],[0,1],[1,0],[1,1]], dtype=float)
y = np.array([[0],[1],[1],[0]], dtype=float)
# 2 -> 4 -> 1 MLP with ReLU hidden and sigmoid output
W1 = np.random.randn(2, 4) * 0.5; b1 = np.zeros(4)
W2 = np.random.randn(4, 1) * 0.5; b2 = np.zeros(1)
def sigmoid(z): return 1 / (1 + np.exp(-z))
for epoch in range(5000):
# Forward
h = np.maximum(0, X @ W1 + b1) # ReLU hidden
out = sigmoid(h @ W2 + b2) # sigmoid output
# Backward (binary cross-entropy)
dout = (out - y) / len(X) # dL/d(out·sig)
dW2 = h.T @ dout; db2 = dout.sum(0)
dh = (dout @ W2.T) * (h > 0) # ReLU gradient
dW1 = X.T @ dh; db1 = dh.sum(0)
# Update
for p, g in [(W1,dW1),(b1,db1),(W2,dW2),(b2,db2)]: p -= 0.5 * g
preds = (sigmoid(np.maximum(0, X@W1+b1)@W2+b2) > 0.5).astype(int)
print("XOR solved:", preds.ravel()) # [0 1 1 0] -- correct!
# The hidden layer made XOR linearly separable for the output neuron.The activation function is the single most consequential architectural choice after depth. It determines gradient flow, training stability, and representational capacity. The history of deep learning is partly a history of better activation functions.
| Activation | Formula | Range | Used in |
|---|---|---|---|
| Sigmoid | 1/(1+e⁻ˣ) | (0,1) | Output gates, binary classification head |
| Tanh | (eˣ-e⁻ˣ)/(eˣ+e⁻ˣ) | (-1,1) | Old RNNs, LSTM gates |
| ReLU | max(0, x) | [0,∞) | CNNs, default hidden activation since 2012 |
| Leaky ReLU | max(0.01x, x) | (-∞,∞) | Avoids dead neurons |
| GELU | x·Φ(x) | (-0.17,∞) | BERT, GPT-2/3, most Transformers |
| SiLU/Swish | x·σ(x) | (-0.28,∞) | EfficientNet, some LLMs |
| SwiGLU | Swish(xW)⊙(xV) | (-∞,∞) | LLaMA, PaLM, modern LLM FFN |
The ReLU Revolution
ReLU (Nair & Hinton, 2010; popularised by AlexNet, 2012) replaced sigmoid/tanh in hidden layers and made deep networks trainable. Its derivative is 1 for positive inputs — gradients pass through unattenuated, defeating the vanishing gradient problem of Chapter 9. It is also trivially cheap to compute.
ReLU(x) = max(0, x)
ReLU'(x) = 1 if x > 0, else 0 # no attenuation for active unitsGELU and SwiGLU: The Transformer Activations
Modern Transformers use smooth activations. GELU (Gaussian Error Linear Unit) weights inputs by the probability they are positive under a Gaussian — a smooth, differentiable-everywhere alternative to ReLU. SwiGLU, a gated variant, powers the feed-forward networks of LLaMA and PaLM.
GELU(x) = x · Φ(x) ≈ 0.5x(1 + tanh[√(2/π)(x + 0.044715x³)])
SwiGLU(x) = Swish(xW + b) ⊙ (xV + c) # gated, 3 matrices
Swish(x) = x · σ(βx)import numpy as np
def sigmoid(x): s = 1/(1+np.exp(-x)); return s, s*(1-s)
def tanh(x): t = np.tanh(x); return t, 1-t**2
def relu(x): return np.maximum(0,x), (x>0).astype(float)
def leaky_relu(x, a=0.01): return np.where(x>0,x,a*x), np.where(x>0,1,a)
def gelu(x): # tanh approximation (used in GPT-2)
c = np.sqrt(2/np.pi)
inner = c*(x + 0.044715*x**3)
val = 0.5*x*(1+np.tanh(inner))
# derivative omitted for brevity; use autograd in practice
return val
def swish(x, beta=1.0): return x*sigmoid(beta*x)[0]
# Key property: max gradient of each activation
x = np.linspace(-5, 5, 1000)
print(f"sigmoid max grad: {sigmoid(x)[1].max():.3f}") # 0.25 -> vanishing
print(f"tanh max grad: {tanh(x)[1].max():.3f}") # 1.00 -> better
print(f"relu max grad: {relu(x)[1].max():.3f}") # 1.00 -> no attenuationBefore training begins, the weights must be set to something. This choice is not a minor detail: initialize too large and activations explode; too small and they vanish; both make the network untrainable. Proper initialization keeps the variance of activations and gradients stable across all layers.
The Variance Propagation Argument
Consider a linear layer y = Wx with fan-in n. If each weight has variance Var(W) and inputs have variance Var(x), then each output has variance n·Var(W)·Var(x). To keep Var(y) = Var(x), we need Var(W) = 1/n. This is the core insight behind all principled initialization schemes.
Var(y_i) = n · Var(W) · Var(x) # forward pass variance
Xavier/Glorot: Var(W) = 2/(n_in + n_out) # balances fwd & bwd, for tanh
He/Kaiming: Var(W) = 2/n_in # accounts for ReLU zeroing halfXavier initialization (Glorot & Bengio, 2010) balances forward and backward variance for symmetric activations like tanh. He initialization (He et al., 2015) adds a factor of 2 to compensate for ReLU setting half the activations to zero, halving the variance.
import numpy as np
def test_init(init_fn, depth=50, width=256):
"""Pass a signal through `depth` ReLU layers; track activation std."""
x = np.random.randn(1000, width) # batch of 1000
stds = [x.std()]
for _ in range(depth):
W = init_fn(width, width)
x = np.maximum(0, x @ W) # ReLU layer
stds.append(x.std())
return stds
# Too small: signal vanishes
tiny = lambda i,o: np.random.randn(i,o) * 0.01
# Too large: signal explodes
huge = lambda i,o: np.random.randn(i,o) * 1.0
# He init: signal preserved
he = lambda i,o: np.random.randn(i,o) * np.sqrt(2.0/i)
print(f"tiny: layer 50 std = {test_init(tiny)[-1]:.2e}") # ~1e-30 vanished
print(f"huge: layer 50 std = {test_init(huge)[-1]:.2e}") # ~1e+10 exploded
print(f"He: layer 50 std = {test_init(he)[-1]:.2e}") # ~1.0 stable!Even with good initialization, the distribution of activations shifts during training as weights update — a phenomenon called internal covariate shift. Normalization layers re-center and re-scale activations on the fly, dramatically accelerating and stabilising training. The choice of normalization is one of the defining differences between CNN-era and Transformer-era architectures.
Batch Normalization
BatchNorm (Ioffe & Szegedy, 2015) normalizes each feature across the batch dimension, then applies a learned scale γ and shift β. It revolutionised CNN training but has a critical weakness for sequence models: it couples examples in a batch and behaves differently at train vs. inference time.
μ_B = (1/m) Σᵢ xᵢ # batch mean
σ²_B = (1/m) Σᵢ (xᵢ - μ_B)² # batch variance
x̂ᵢ = (xᵢ - μ_B) / √(σ²_B + ε) # normalize
yᵢ = γ x̂ᵢ + β # learned scale & shiftLayer Normalization
LayerNorm (Ba et al., 2016) normalizes across the feature dimension for each example independently. This makes it batch-size independent and identical at train and inference time — exactly what sequence models need. Every Transformer uses LayerNorm (or its RMS variant).
| BatchNorm | LayerNorm |
|---|---|
| Normalizes across the batch dimension | Normalizes across the feature dimension |
| Couples examples in a batch | Each example normalized independently |
| Different behaviour train vs. inference | Identical at train and inference |
| Breaks with batch size 1 or variable length | Works with any batch size or length |
| Dominant in CNNs (ResNet, etc.) | Dominant in Transformers |
| Needs running statistics for inference | No running statistics needed |
RMSNorm: The Modern Simplification
RMSNorm (Zhang & Sennrich, 2019) drops the mean-centering of LayerNorm, normalizing only by the root-mean-square of the activations. It is cheaper, and works as well or better in practice. LLaMA, Gemma, and most recent LLMs use RMSNorm.
LayerNorm: y = γ · (x - μ) / √(σ² + ε) + β # center AND scale
RMSNorm: y = γ · x / √(mean(x²) + ε) # scale only, no centeringimport numpy as np
def batch_norm(x, gamma, beta, eps=1e-5): # x: (batch, features)
mu = x.mean(axis=0, keepdims=True) # across BATCH
var = x.var(axis=0, keepdims=True)
return gamma * (x - mu) / np.sqrt(var + eps) + beta
def layer_norm(x, gamma, beta, eps=1e-5): # x: (batch, features)
mu = x.mean(axis=1, keepdims=True) # across FEATURES
var = x.var(axis=1, keepdims=True)
return gamma * (x - mu) / np.sqrt(var + eps) + beta
def rms_norm(x, gamma, eps=1e-5): # no mean-centering, no beta
rms = np.sqrt((x**2).mean(axis=1, keepdims=True) + eps)
return gamma * x / rms
x = np.random.randn(4, 768) * 10 + 5 # large-mean activations
g = np.ones(768); b = np.zeros(768)
ln = layer_norm(x, g, b)
print(f"LayerNorm: per-row mean={ln.mean(1).mean():.4f}, std={ln.std(1).mean():.4f}")
# LayerNorm: per-row mean=0.0000, std=1.0000
rn = rms_norm(x, g)
print(f"RMSNorm: per-row RMS={np.sqrt((rn**2).mean(1)).mean():.4f}")
# RMSNorm: per-row RMS=1.0000 (but mean is NOT zeroed)Dropout (Srivastava et al., 2014) is a regularization technique that randomly sets a fraction p of activations to zero during each training step. This prevents neurons from co-adapting — relying on specific other neurons — and forces redundant, robust representations. At inference, dropout is disabled and activations are used in full.
Inverted Dropout
The standard implementation is inverted dropout: during training, surviving activations are scaled up by 1/(1-p) so the expected value is preserved. This lets inference run with no scaling at all — the most common source of dropout bugs is forgetting this scaling.
mask ~ Bernoulli(1 - p) # 1 with prob (1-p), else 0
y = (x ⊙ mask) / (1 - p) # zero some, scale up the rest
# At inference: y = x (no mask, no scaling)import numpy as np
def dropout(x, p=0.5, training=True):
"""Inverted dropout. p = fraction to drop."""
if not training or p == 0:
return x # inference: identity
mask = (np.random.rand(*x.shape) > p) # keep with prob (1-p)
return x * mask / (1 - p) # zero & scale up survivors
# Expected value is preserved
x = np.ones(10000)
y = dropout(x, p=0.5, training=True)
print(f"Input mean: {x.mean():.3f}") # 1.000
print(f"Dropout mean: {y.mean():.3f}") # ~1.000 (preserved by 1/(1-p) scaling)
print(f"Fraction zeroed: {(y==0).mean():.3f}") # ~0.500The loss function defines what the network optimizes. Chapter 4 derived cross-entropy from information theory; here we connect it to the network's output layer and survey the practical loss functions you will use.
| Task | Output layer | Loss function |
|---|---|---|
| Binary classification | 1 unit + sigmoid | Binary cross-entropy |
| Multiclass classification | K units + softmax | Categorical cross-entropy |
| Language modeling | V units + softmax | Cross-entropy over vocabulary |
| Regression | 1+ linear units | Mean squared error (MSE) |
| Robust regression | 1+ linear units | Huber / mean absolute error |
| Multi-label | K units + sigmoid | Sum of binary cross-entropies |
| Contrastive / retrieval | Normalized embeddings | InfoNCE / triplet loss |
The pairing of softmax output with cross-entropy loss is special: as shown in Chapter 4, their combined gradient simplifies to the elegant (prediction − target). This is why nearly every classifier and language model uses this exact pairing — it is numerically stable and computationally trivial.
We now assemble everything — linear layers, activations, initialization, normalization, dropout, and loss — into a complete, trainable MLP. This is a miniature version of the architecture you will scale up to the Transformer.
import numpy as np
class MLP:
"""Multi-layer perceptron with He init, ReLU, dropout, and softmax-CE."""
def __init__(self, sizes, dropout=0.0, seed=0):
rng = np.random.default_rng(seed)
self.W, self.b = [], []
for nin, nout in zip(sizes[:-1], sizes[1:]):
self.W.append(rng.normal(0, np.sqrt(2/nin), (nin, nout))) # He init
self.b.append(np.zeros(nout))
self.dropout = dropout
def forward(self, x, training=True):
self.cache = [x]
for i, (W, b) in enumerate(zip(self.W, self.b))):
z = x @ W + b
if i < len(self.W) - 1: # hidden layers
x = np.maximum(0, z) # ReLU
if training and self.dropout > 0:
mask = (np.random.rand(*x.shape) > self.dropout) / (1 - self.dropout)
x = x * mask
else: # output layer: stable softmax
z -= z.max(axis=1, keepdims=True)
x = np.exp(z); x /= x.sum(axis=1, keepdims=True)
self.cache.append(x)
return x
def loss_and_grad(self, probs, y): # y: integer labels
N = len(y)
loss = -np.log(probs[np.arange(N), y] + 1e-12).mean()
grad = probs.copy(); grad[np.arange(N), y] -= 1; grad /= N # softmax-CE grad
return loss, grad
# This MLP -- linear layers + ReLU + dropout + softmax-CE -- contains
# every concept needed to understand a Transformer's feed-forward network.Shape Trace Through the MLP
Tracking tensor shapes is the single most important debugging skill. Here is the shape of data flowing through a 784→256→10 MLP (MNIST classifier) with a batch of 32:
Shape Trace: MLP forward pass (batch=32)
| Operation | Shape | Note |
|---|---|---|
| input x | (32, 784) | flattened 28×28 image |
| x @ W1 + b1 | (32, 256) | linear: 784→256 |
| ReLU | (32, 256) | elementwise, shape unchanged |
| dropout | (32, 256) | mask + scale, shape unchanged |
| x @ W2 + b2 | (32, 10) | linear: 256→10 (logits) |
| softmax | (32, 10) | row-wise probabilities |
| cross-entropy | (scalar) | mean over batch |
Training a neural network is an empirical science. Loss curves, gradient norms, and activation statistics tell you what is happening inside the network. Learning to read these signals is what separates practitioners who can debug from those who can only guess.
Reading Loss Curves
| Symptom | Likely cause | Fix |
|---|---|---|
| Loss is NaN immediately | Exploding activations / log(0) | Lower LR, gradient clip, check init |
| Loss flat, never decreases | LR too low, or dead ReLUs | Raise LR, check init, use LeakyReLU |
| Loss decreases then explodes | LR too high | Lower LR, add warmup, clip gradients |
| Train loss ↓, val loss ↑ | Overfitting | More data, dropout, weight decay, early stop |
| Both losses plateau high | Underfitting / too small | More capacity, train longer, better features |
| Loss oscillates wildly | Batch too small / LR too high | Larger batch, lower LR, more momentum |
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
X, y = fetch_openml('mnist_784', return_X_y=True, as_frame=False)
X = X / 255.0; y = y.astype(int) # normalize to [0,1]
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=0)
net = MLP([784, 256, 128, 10], dropout=0.2)
lr, batch = 0.1, 128
for epoch in range(10):
perm = np.random.permutation(len(Xtr))
for i in range(0, len(Xtr), batch):
idx = perm[i:i+batch]
probs = net.forward(Xtr[idx], training=True)
loss, grad = net.loss_and_grad(probs, ytr[idx])
net.backward(grad, lr) # (backward + SGD update)
# Diagnostics each epoch
test_acc = (net.forward(Xte, training=False).argmax(1) == yte).mean()
grad_norm = np.sqrt(sum((g**2).sum() for g in net.last_grads))
print(f"epoch {epoch}: loss={loss:.3f} test_acc={test_acc:.3f} |grad|={grad_norm:.2f}")
# epoch 0: loss=0.412 test_acc=0.928 |grad|=1.84
# epoch 9: loss=0.087 test_acc=0.978 |grad|=0.43 -- healthy convergenceEverything in this chapter is a literal component of the Transformer you will build in Chapter 13. The Transformer is not a fundamentally new kind of network — it is a particular arrangement of the same primitives: linear layers, activations, normalization, dropout, and residual connections.
| This chapter | Role in the Transformer block |
|---|---|
| Linear layer (W x + b) | Q/K/V projections; FFN layers; output projection |
| Activation (GELU/SwiGLU) | The nonlinearity inside the feed-forward network |
| LayerNorm / RMSNorm | Applied before attention and FFN (Pre-LN) |
| Dropout | After attention softmax and after each sublayer |
| He / scaled init | Initializes all projection matrices |
| Residual connection | x + sublayer(x) — the gradient highway from Ch. 9 |
| Softmax + cross-entropy | The attention weights AND the final LM loss |
The Transformer Block Preview
Here is the Pre-LN Transformer block you will build in Chapter 13, drawn as a stack. Notice that every box is a component from this chapter — only 'Multi-Head Attention' (Chapter 12) is new:
Arch Stack: Pre-LN Transformer Block
| + residual add | x + FFN_out |
| Dropout | p = 0.1 |
| Feed-Forward (SwiGLU) | d → 4d → d |
| LayerNorm / RMSNorm | normalize |
| + residual add | x + Attn_out |
| Dropout | p = 0.1 |
| Multi-Head Attention | (Chapter 12) |
| LayerNorm / RMSNorm | normalize |
| input x | (B, T, d) |
Component Quick-Reference
| Component | Purpose | Default choice (2024) |
|---|---|---|
| Hidden activation | Nonlinearity | GELU or SwiGLU |
| Output activation | Map to task | Softmax (classification), linear (regression) |
| Initialization | Stable signal propagation | He (ReLU), scaled-by-depth (Transformer) |
| Normalization | Stabilize activations | RMSNorm (LLMs), LayerNorm (general) |
| Regularization | Prevent overfitting | Dropout 0.1, weight decay 0.1 |
| Loss | Define objective | Cross-entropy (classification/LM) |
| Optimizer | Update weights | AdamW (from Chapter 2) |
Exercises
Exercises 1–10 are pen-and-paper; 11–20 require code.
Further reading: “Deep Learning” (Goodfellow, Bengio, Courville, 2016) Chapters 6–8 — the canonical reference for MLPs, regularization, and optimization. “Delving Deep into Rectifiers” (He et al., 2015) for He initialization. “Layer Normalization” (Ba et al., 2016) and “Root Mean Square Layer Normalization” (Zhang & Sennrich, 2019). “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” (Srivastava et al., 2014). Michael Nielsen's free online book “Neural Networks and Deep Learning” for unmatched intuition.
Next → Chapter 11: Backpropagation in Depth
You have built MLPs and updated their weights, but the backward passes so far have been hand-derived for each specific network. Chapter 11 generalizes this: we build a complete automatic differentiation engine that computes gradients for any computational graph. You will understand exactly how PyTorch and JAX work under the hood, build a working autograd system, and gain the skills to debug gradients through the deep stacks of a Transformer.