Part II: Classical ML & Representations

Chapter 10

Neural Network Fundamentals

From the perceptron to the multi-layer network: activation functions, weight initialization, normalization, and dropout — the components every Transformer is built from.

20 Exercises

Learning Objectives

1.	Trace the line from the perceptron to the multi-layer perceptron and understand why depth matters.
2.	Prove the universal approximation theorem informally and understand its practical limits.
3.	Compare activation functions (sigmoid, tanh, ReLU, GELU, SwiGLU) and know which to use where.
4.	Derive Xavier and He initialization and explain why initialization scale controls trainability.
5.	Understand batch, layer, and RMS normalization and why Transformers chose LayerNorm.
6.	Explain dropout as an implicit ensemble and implement inverted dropout correctly.
7.	Implement a complete MLP from scratch with all components and train it on real data.
8.	Connect every component in this chapter to its exact role inside a Transformer block.

Every neural network, including the largest Transformers, is built from a single repeated unit: the artificial neuron. Its simplest form, the perceptron (Rosenblatt, 1958), computes a weighted sum of its inputs, adds a bias, and applies a step function. Understanding its capabilities — and its famous failure — is the foundation for everything that follows.

The Perceptron Model

y = step(w · x + b) where step(z) = 1 if z ≥ 0, else 0

The perceptron is a linear classifier: w · x + b = 0 defines a hyperplane, and the step function assigns each side to a class. It is exactly the logistic regression of Chapter 6 with a hard threshold replacing the sigmoid.

✧

History: The Perceptron Hype and the AI Winter

Rosenblatt's perceptron generated enormous excitement — the New York Times reported in 1958 that the Navy expected it to 'walk, talk, see, write, reproduce itself and be conscious of its existence.'

In 1969, Minsky and Papert proved the single-layer perceptron cannot learn the XOR function. This result, widely (mis)interpreted as a fatal limitation of neural networks, contributed to the first 'AI winter' — a decade of reduced funding. The irony: multi-layer perceptrons solve XOR easily, but no one knew how to train them until backpropagation was popularised in 1986.

The XOR Problem

XOR (exclusive or) outputs 1 when exactly one input is 1. The four points (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0 cannot be separated by any single straight line — they are not linearly separable. A single perceptron fails. This is the limitation that motivates depth.

Python•The perceptron and its XOR failure
import numpy as np

class Perceptron:
    def __init__(self, dim, lr=0.1):
        self.w = np.zeros(dim); self.b = 0.0; self.lr = lr

    def predict(self, x): return int(x @ self.w + self.b >= 0)

    def fit(self, X, y, epochs=100):
        for _ in range(epochs):
            for xi, yi in zip(X, y):
                pred = self.predict(xi)
                # Perceptron learning rule: update on mistakes
                self.w += self.lr * (yi - pred) * xi
                self.b += self.lr * (yi - pred)

# Linearly separable AND gate: works perfectly
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y_and = np.array([0,0,0,1])
p = Perceptron(2); p.fit(X, y_and)
print("AND:", [p.predict(x) for x in X])  # [0, 0, 0, 1] correct

# XOR: never converges -- not linearly separable
y_xor = np.array([0,1,1,0])
p2 = Perceptron(2); p2.fit(X, y_xor, epochs=1000)
print("XOR:", [p2.predict(x) for x in X])  # wrong, no matter how long
# A single perceptron CANNOT represent XOR. We need a hidden layer.

The fix for XOR — and for every non-linearly-separable problem — is to stack layers of neurons, separated by nonlinear activation functions. A multi-layer perceptron (MLP) transforms its input through a sequence of linear projections and nonlinearities, learning a representation in which the problem becomes separable.

The MLP Forward Pass

text•Two-layer MLP
h  =  σ(W₁ x + b₁)        # hidden layer  (D → H)
y  =  W₂ h + b₂           # output layer  (H → K)

σ = nonlinear activation (ReLU, GELU, …)

The nonlinearity σ is essential. Without it, stacking linear layers collapses to a single linear layer: W₂(W₁ x) = (W₂ W₁) x. The nonlinearity is what gives depth its power — each layer can warp the space so the next layer's linear boundary becomes useful.

✧

Intuition: Why Depth Solves XOR

The hidden layer transforms the four XOR points into a new space. One hidden neuron can learn 'OR' and another can learn 'AND'; the output neuron then computes 'OR AND NOT-AND' = XOR. The hidden representation makes the problem linearly separable.

This is the master pattern of all deep learning: early layers learn a representation in which the final layer's simple (often linear) decision becomes easy. A 100-layer Transformer is this same idea, scaled up enormously.

The Universal Approximation Theorem

Cybenko (1989) and Hornik (1991) proved that an MLP with a single hidden layer and a sufficient number of neurons can approximate any continuous function on a compact domain to arbitrary accuracy. This is the theoretical guarantee that neural networks are expressive enough for any task.

⚠️

The Universal Approximation Caveat

Universal approximation says a wide-enough shallow network CAN represent any function — it does not say such a network is easy to find by gradient descent, nor that the required width is practical. A shallow network might need exponentially many neurons where a deep network needs only polynomially many (Telgarsky, 2016).

This is the formal justification for depth: deep networks represent certain functions exponentially more efficiently than shallow ones. Depth is not just expressiveness — it is efficiency.

Python•MLP solving XOR from scratch
import numpy as np

np.random.seed(0)
X = np.array([[0,0],[0,1],[1,0],[1,1]], dtype=float)
y = np.array([[0],[1],[1],[0]], dtype=float)

# 2 -> 4 -> 1 MLP with ReLU hidden and sigmoid output
W1 = np.random.randn(2, 4) * 0.5; b1 = np.zeros(4)
W2 = np.random.randn(4, 1) * 0.5; b2 = np.zeros(1)

def sigmoid(z): return 1 / (1 + np.exp(-z))

for epoch in range(5000):
    # Forward
    h    = np.maximum(0, X @ W1 + b1)       # ReLU hidden
    out  = sigmoid(h @ W2 + b2)              # sigmoid output
    # Backward (binary cross-entropy)
    dout = (out - y) / len(X)                # dL/d(out·sig)
    dW2  = h.T @ dout;  db2 = dout.sum(0)
    dh   = (dout @ W2.T) * (h > 0)         # ReLU gradient
    dW1  = X.T @ dh;   db1 = dh.sum(0)
    # Update
    for p, g in [(W1,dW1),(b1,db1),(W2,dW2),(b2,db2)]: p -= 0.5 * g

preds = (sigmoid(np.maximum(0, X@W1+b1)@W2+b2) > 0.5).astype(int)
print("XOR solved:", preds.ravel())  # [0 1 1 0] -- correct!
# The hidden layer made XOR linearly separable for the output neuron.

The activation function is the single most consequential architectural choice after depth. It determines gradient flow, training stability, and representational capacity. The history of deep learning is partly a history of better activation functions.

Activation	Formula	Range	Used in
Sigmoid	1/(1+e⁻ˣ)	(0,1)	Output gates, binary classification head
Tanh	(eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)	(-1,1)	Old RNNs, LSTM gates
ReLU	max(0, x)	[0,∞)	CNNs, default hidden activation since 2012
Leaky ReLU	max(0.01x, x)	(-∞,∞)	Avoids dead neurons
GELU	x·Φ(x)	(-0.17,∞)	BERT, GPT-2/3, most Transformers
SiLU/Swish	x·σ(x)	(-0.28,∞)	EfficientNet, some LLMs
SwiGLU	Swish(xW)⊙(xV)	(-∞,∞)	LLaMA, PaLM, modern LLM FFN

The ReLU Revolution

ReLU (Nair & Hinton, 2010; popularised by AlexNet, 2012) replaced sigmoid/tanh in hidden layers and made deep networks trainable. Its derivative is 1 for positive inputs — gradients pass through unattenuated, defeating the vanishing gradient problem of Chapter 9. It is also trivially cheap to compute.

text•ReLU and its gradient
ReLU(x)  =  max(0, x)
ReLU'(x) =  1 if x > 0,  else 0      # no attenuation for active units

⚠️

Pitfall: The Dying ReLU Problem

A ReLU neuron whose pre-activation is always negative outputs 0 and has zero gradient — it can never recover and is effectively dead. With a bad initialization or too-high learning rate, a large fraction of neurons can die, permanently reducing model capacity.

Fixes: Leaky ReLU (small negative slope keeps gradient alive), careful initialization (He init), and lower learning rates. GELU also avoids the hard zero, smoothly approaching zero for negative inputs.

GELU and SwiGLU: The Transformer Activations

Modern Transformers use smooth activations. GELU (Gaussian Error Linear Unit) weights inputs by the probability they are positive under a Gaussian — a smooth, differentiable-everywhere alternative to ReLU. SwiGLU, a gated variant, powers the feed-forward networks of LLaMA and PaLM.

text•GELU and SwiGLU
GELU(x)   =  x · Φ(x)  ≈  0.5x(1 + tanh[√(2/π)(x + 0.044715x³)])

SwiGLU(x) =  Swish(xW + b) ⊙ (xV + c)     # gated, 3 matrices
Swish(x)  =  x · σ(βx)

Python•All activation functions and their gradients
import numpy as np

def sigmoid(x):  s = 1/(1+np.exp(-x)); return s, s*(1-s)
def tanh(x):     t = np.tanh(x);       return t, 1-t**2
def relu(x):                          return np.maximum(0,x), (x>0).astype(float)
def leaky_relu(x, a=0.01):       return np.where(x>0,x,a*x), np.where(x>0,1,a)

def gelu(x):  # tanh approximation (used in GPT-2)
    c = np.sqrt(2/np.pi)
    inner = c*(x + 0.044715*x**3)
    val = 0.5*x*(1+np.tanh(inner))
    # derivative omitted for brevity; use autograd in practice
    return val

def swish(x, beta=1.0):  return x*sigmoid(beta*x)[0]

# Key property: max gradient of each activation
x = np.linspace(-5, 5, 1000)
print(f"sigmoid max grad: {sigmoid(x)[1].max():.3f}")  # 0.25 -> vanishing
print(f"tanh    max grad: {tanh(x)[1].max():.3f}")     # 1.00 -> better
print(f"relu    max grad: {relu(x)[1].max():.3f}")     # 1.00 -> no attenuation

▶

ML Connection: Why LLaMA Uses SwiGLU

The feed-forward network in each Transformer block is where most parameters live (often 2/3 of the model). LLaMA and PaLM replaced the standard ReLU/GELU FFN with SwiGLU, which empirically improves perplexity at equal parameter count.

SwiGLU uses three weight matrices instead of two; to keep parameter count constant the hidden dimension is reduced from 4d to (8/3)d. You will implement this exact FFN in Chapter 13.

Before training begins, the weights must be set to something. This choice is not a minor detail: initialize too large and activations explode; too small and they vanish; both make the network untrainable. Proper initialization keeps the variance of activations and gradients stable across all layers.

The Variance Propagation Argument

Consider a linear layer y = Wx with fan-in n. If each weight has variance Var(W) and inputs have variance Var(x), then each output has variance n·Var(W)·Var(x). To keep Var(y) = Var(x), we need Var(W) = 1/n. This is the core insight behind all principled initialization schemes.

text•Variance-preserving initialization
Var(y_i) = n · Var(W) · Var(x)        # forward pass variance

Xavier/Glorot:  Var(W) = 2/(n_in + n_out)    # balances fwd & bwd, for tanh
He/Kaiming:     Var(W) = 2/n_in              # accounts for ReLU zeroing half

Xavier initialization (Glorot & Bengio, 2010) balances forward and backward variance for symmetric activations like tanh. He initialization (He et al., 2015) adds a factor of 2 to compensate for ReLU setting half the activations to zero, halving the variance.

Python•Initialization schemes and their effect on signal propagation
import numpy as np

def test_init(init_fn, depth=50, width=256):
    """Pass a signal through `depth` ReLU layers; track activation std."""
    x = np.random.randn(1000, width)  # batch of 1000
    stds = [x.std()]
    for _ in range(depth):
        W = init_fn(width, width)
        x = np.maximum(0, x @ W)       # ReLU layer
        stds.append(x.std())
    return stds

# Too small: signal vanishes
tiny  = lambda i,o: np.random.randn(i,o) * 0.01
# Too large: signal explodes
huge  = lambda i,o: np.random.randn(i,o) * 1.0
# He init: signal preserved
he    = lambda i,o: np.random.randn(i,o) * np.sqrt(2.0/i)

print(f"tiny: layer 50 std = {test_init(tiny)[-1]:.2e}")  # ~1e-30 vanished
print(f"huge: layer 50 std = {test_init(huge)[-1]:.2e}")  # ~1e+10 exploded
print(f"He:   layer 50 std = {test_init(he)[-1]:.2e}")    # ~1.0 stable!

✧

Train Note: Transformer Initialization in Practice

GPT-2 initializes weights from N(0, 0.02) and scales residual-projection weights by 1/√(2N) where N is the number of layers, preventing residual-stream variance from growing with depth.

Modern LLMs often use 'mu-parametrization' (μP) to make optimal hyperparameters transfer across model scales, and many use small fixed std (0.006–0.02). Initialization remains an active area: small changes measurably affect training stability at scale.

Even with good initialization, the distribution of activations shifts during training as weights update — a phenomenon called internal covariate shift. Normalization layers re-center and re-scale activations on the fly, dramatically accelerating and stabilising training. The choice of normalization is one of the defining differences between CNN-era and Transformer-era architectures.

Batch Normalization

BatchNorm (Ioffe & Szegedy, 2015) normalizes each feature across the batch dimension, then applies a learned scale γ and shift β. It revolutionised CNN training but has a critical weakness for sequence models: it couples examples in a batch and behaves differently at train vs. inference time.

text•Batch Normalization (per feature, across batch)
μ_B = (1/m) Σᵢ xᵢ           # batch mean
σ²_B = (1/m) Σᵢ (xᵢ - μ_B)²   # batch variance
x̂ᵢ = (xᵢ - μ_B) / √(σ²_B + ε)  # normalize
yᵢ = γ x̂ᵢ + β              # learned scale & shift

Layer Normalization

LayerNorm (Ba et al., 2016) normalizes across the feature dimension for each example independently. This makes it batch-size independent and identical at train and inference time — exactly what sequence models need. Every Transformer uses LayerNorm (or its RMS variant).

BatchNorm	LayerNorm
Normalizes across the batch dimension	Normalizes across the feature dimension
Couples examples in a batch	Each example normalized independently
Different behaviour train vs. inference	Identical at train and inference
Breaks with batch size 1 or variable length	Works with any batch size or length
Dominant in CNNs (ResNet, etc.)	Dominant in Transformers
Needs running statistics for inference	No running statistics needed

RMSNorm: The Modern Simplification

RMSNorm (Zhang & Sennrich, 2019) drops the mean-centering of LayerNorm, normalizing only by the root-mean-square of the activations. It is cheaper, and works as well or better in practice. LLaMA, Gemma, and most recent LLMs use RMSNorm.

text•LayerNorm vs RMSNorm
LayerNorm:  y = γ · (x - μ) / √(σ² + ε) + β     # center AND scale
RMSNorm:    y = γ · x / √(mean(x²) + ε)          # scale only, no centering

Python•All three normalizations from scratch
import numpy as np

def batch_norm(x, gamma, beta, eps=1e-5):  # x: (batch, features)
    mu  = x.mean(axis=0, keepdims=True)       # across BATCH
    var = x.var(axis=0, keepdims=True)
    return gamma * (x - mu) / np.sqrt(var + eps) + beta

def layer_norm(x, gamma, beta, eps=1e-5):  # x: (batch, features)
    mu  = x.mean(axis=1, keepdims=True)       # across FEATURES
    var = x.var(axis=1, keepdims=True)
    return gamma * (x - mu) / np.sqrt(var + eps) + beta

def rms_norm(x, gamma, eps=1e-5):  # no mean-centering, no beta
    rms = np.sqrt((x**2).mean(axis=1, keepdims=True) + eps)
    return gamma * x / rms

x = np.random.randn(4, 768) * 10 + 5  # large-mean activations
g = np.ones(768); b = np.zeros(768)

ln = layer_norm(x, g, b)
print(f"LayerNorm: per-row mean={ln.mean(1).mean():.4f}, std={ln.std(1).mean():.4f}")
# LayerNorm: per-row mean=0.0000, std=1.0000

rn = rms_norm(x, g)
print(f"RMSNorm:   per-row RMS={np.sqrt((rn**2).mean(1)).mean():.4f}")
# RMSNorm:   per-row RMS=1.0000 (but mean is NOT zeroed)

▶

ML Connection: Pre-LN vs Post-LN Transformers

Where you place LayerNorm matters enormously. The original Transformer used Post-LN (norm AFTER the residual add); modern LLMs use Pre-LN (norm BEFORE the sublayer). Pre-LN keeps the residual stream clean, making very deep Transformers trainable without learning-rate warmup tricks.

You will see in Chapter 13 that the Pre-LN block computes: x + Attention(LN(x)) and x + FFN(LN(x)). This single placement decision was the difference between Transformers that train stably at 100+ layers and ones that diverge.

Dropout (Srivastava et al., 2014) is a regularization technique that randomly sets a fraction p of activations to zero during each training step. This prevents neurons from co-adapting — relying on specific other neurons — and forces redundant, robust representations. At inference, dropout is disabled and activations are used in full.

Inverted Dropout

The standard implementation is inverted dropout: during training, surviving activations are scaled up by 1/(1-p) so the expected value is preserved. This lets inference run with no scaling at all — the most common source of dropout bugs is forgetting this scaling.

text•Inverted dropout (training time)
mask ~ Bernoulli(1 - p)           # 1 with prob (1-p), else 0
y = (x ⊙ mask) / (1 - p)           # zero some, scale up the rest

# At inference: y = x  (no mask, no scaling)

Python•Inverted dropout from scratch
import numpy as np

def dropout(x, p=0.5, training=True):
    """Inverted dropout. p = fraction to drop."""
    if not training or p == 0:
        return x                          # inference: identity
    mask = (np.random.rand(*x.shape) > p)  # keep with prob (1-p)
    return x * mask / (1 - p)           # zero & scale up survivors

# Expected value is preserved
x = np.ones(10000)
y = dropout(x, p=0.5, training=True)
print(f"Input mean:  {x.mean():.3f}")   # 1.000
print(f"Dropout mean: {y.mean():.3f}")  # ~1.000 (preserved by 1/(1-p) scaling)
print(f"Fraction zeroed: {(y==0).mean():.3f}")  # ~0.500

✧

Intuition: Dropout as Ensemble Averaging

Each training step uses a different random subnetwork (different dropout mask). A network with n neurons has 2^n possible subnetworks, and dropout trains an exponentially large ensemble of them with shared weights.

At inference, using all neurons with the 1/(1-p) scaling approximates averaging this entire ensemble — a cheap approximation to model averaging, which is one of the most reliable ways to improve generalization.

✧

Train Note: Dropout in Transformers

Transformers apply dropout in several places: after attention softmax (attention dropout), after each sublayer's output before the residual add (residual dropout), and on the embedding sum. Typical rates are 0.1 for large models, higher (0.3) for smaller models on smaller datasets.

The largest modern LLMs often use little or no dropout during pretraining — with internet-scale data, overfitting is less of a concern than underfitting, and the regularization comes from data diversity instead.

The loss function defines what the network optimizes. Chapter 4 derived cross-entropy from information theory; here we connect it to the network's output layer and survey the practical loss functions you will use.

Task	Output layer	Loss function
Binary classification	1 unit + sigmoid	Binary cross-entropy
Multiclass classification	K units + softmax	Categorical cross-entropy
Language modeling	V units + softmax	Cross-entropy over vocabulary
Regression	1+ linear units	Mean squared error (MSE)
Robust regression	1+ linear units	Huber / mean absolute error
Multi-label	K units + sigmoid	Sum of binary cross-entropies
Contrastive / retrieval	Normalized embeddings	InfoNCE / triplet loss

The pairing of softmax output with cross-entropy loss is special: as shown in Chapter 4, their combined gradient simplifies to the elegant (prediction − target). This is why nearly every classifier and language model uses this exact pairing — it is numerically stable and computationally trivial.

▶

ML Connection: The LM Loss Is Just Cross-Entropy

A language model's training loss is categorical cross-entropy applied at every token position: for each position, the softmax output over the V-token vocabulary is compared to the one-hot true next token. The total loss is the mean over all positions.

This single loss — next-token cross-entropy — trains GPT, LLaMA, Claude, and every other autoregressive LLM. The astonishing capabilities of these models emerge entirely from minimizing this one simple objective at scale.

We now assemble everything — linear layers, activations, initialization, normalization, dropout, and loss — into a complete, trainable MLP. This is a miniature version of the architecture you will scale up to the Transformer.

Python•A complete MLP class from scratch
import numpy as np

class MLP:
    """Multi-layer perceptron with He init, ReLU, dropout, and softmax-CE."""
    def __init__(self, sizes, dropout=0.0, seed=0):
        rng = np.random.default_rng(seed)
        self.W, self.b = [], []
        for nin, nout in zip(sizes[:-1], sizes[1:]):
            self.W.append(rng.normal(0, np.sqrt(2/nin), (nin, nout)))  # He init
            self.b.append(np.zeros(nout))
        self.dropout = dropout

    def forward(self, x, training=True):
        self.cache = [x]
        for i, (W, b) in enumerate(zip(self.W, self.b))):
            z = x @ W + b
            if i < len(self.W) - 1:  # hidden layers
                x = np.maximum(0, z)        # ReLU
                if training and self.dropout > 0:
                    mask = (np.random.rand(*x.shape) > self.dropout) / (1 - self.dropout)
                    x = x * mask
            else:  # output layer: stable softmax
                z -= z.max(axis=1, keepdims=True)
                x = np.exp(z); x /= x.sum(axis=1, keepdims=True)
            self.cache.append(x)
        return x

    def loss_and_grad(self, probs, y):  # y: integer labels
        N = len(y)
        loss = -np.log(probs[np.arange(N), y] + 1e-12).mean()
        grad = probs.copy(); grad[np.arange(N), y] -= 1; grad /= N  # softmax-CE grad
        return loss, grad

# This MLP -- linear layers + ReLU + dropout + softmax-CE -- contains
# every concept needed to understand a Transformer's feed-forward network.

Shape Trace Through the MLP

Tracking tensor shapes is the single most important debugging skill. Here is the shape of data flowing through a 784→256→10 MLP (MNIST classifier) with a batch of 32:

Shape Trace: MLP forward pass (batch=32)

Operation	Shape	Note
input x	(32, 784)	flattened 28×28 image
x @ W1 + b1	(32, 256)	linear: 784→256
ReLU	(32, 256)	elementwise, shape unchanged
dropout	(32, 256)	mask + scale, shape unchanged
x @ W2 + b2	(32, 10)	linear: 256→10 (logits)
softmax	(32, 10)	row-wise probabilities
cross-entropy	(scalar)	mean over batch

Training a neural network is an empirical science. Loss curves, gradient norms, and activation statistics tell you what is happening inside the network. Learning to read these signals is what separates practitioners who can debug from those who can only guess.

Reading Loss Curves

Symptom	Likely cause	Fix
Loss is NaN immediately	Exploding activations / log(0)	Lower LR, gradient clip, check init
Loss flat, never decreases	LR too low, or dead ReLUs	Raise LR, check init, use LeakyReLU
Loss decreases then explodes	LR too high	Lower LR, add warmup, clip gradients
Train loss ↓, val loss ↑	Overfitting	More data, dropout, weight decay, early stop
Both losses plateau high	Underfitting / too small	More capacity, train longer, better features
Loss oscillates wildly	Batch too small / LR too high	Larger batch, lower LR, more momentum

Python•Code Lab: training an MLP on MNIST with diagnostics
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

X, y = fetch_openml('mnist_784', return_X_y=True, as_frame=False)
X = X / 255.0; y = y.astype(int)  # normalize to [0,1]
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=0)

net = MLP([784, 256, 128, 10], dropout=0.2)
lr, batch = 0.1, 128

for epoch in range(10):
    perm = np.random.permutation(len(Xtr))
    for i in range(0, len(Xtr), batch):
        idx = perm[i:i+batch]
        probs = net.forward(Xtr[idx], training=True)
        loss, grad = net.loss_and_grad(probs, ytr[idx])
        net.backward(grad, lr)            # (backward + SGD update)
    # Diagnostics each epoch
    test_acc = (net.forward(Xte, training=False).argmax(1) == yte).mean()
    grad_norm = np.sqrt(sum((g**2).sum() for g in net.last_grads))
    print(f"epoch {epoch}: loss={loss:.3f} test_acc={test_acc:.3f} |grad|={grad_norm:.2f}")
# epoch 0: loss=0.412 test_acc=0.928 |grad|=1.84
# epoch 9: loss=0.087 test_acc=0.978 |grad|=0.43  -- healthy convergence

✧

Train Note: The Overfit-One-Batch Test

Before launching a long training run, verify your model can overfit a single batch of ~10 examples to near-zero loss. If it cannot, you have a bug — in the model, the loss, the gradients, or the data pipeline.

This 60-second test catches the majority of implementation bugs before they waste hours of compute. It is the first thing experienced practitioners do with any new model.

Everything in this chapter is a literal component of the Transformer you will build in Chapter 13. The Transformer is not a fundamentally new kind of network — it is a particular arrangement of the same primitives: linear layers, activations, normalization, dropout, and residual connections.

This chapter	Role in the Transformer block
Linear layer (W x + b)	Q/K/V projections; FFN layers; output projection
Activation (GELU/SwiGLU)	The nonlinearity inside the feed-forward network
LayerNorm / RMSNorm	Applied before attention and FFN (Pre-LN)
Dropout	After attention softmax and after each sublayer
He / scaled init	Initializes all projection matrices
Residual connection	x + sublayer(x) — the gradient highway from Ch. 9
Softmax + cross-entropy	The attention weights AND the final LM loss

The Transformer Block Preview

Here is the Pre-LN Transformer block you will build in Chapter 13, drawn as a stack. Notice that every box is a component from this chapter — only 'Multi-Head Attention' (Chapter 12) is new:

Arch Stack: Pre-LN Transformer Block

+ residual add	x + FFN_out
Dropout	p = 0.1
Feed-Forward (SwiGLU)	d → 4d → d
LayerNorm / RMSNorm	normalize
+ residual add	x + Attn_out
Dropout	p = 0.1
Multi-Head Attention	(Chapter 12)
LayerNorm / RMSNorm	normalize
input x	(B, T, d)

✧

You Already Understand 80% of a Transformer

Of the eight boxes in the Transformer block above, seven are components you have now built from scratch: linear layers, GELU/SwiGLU, LayerNorm, dropout, residual connections, and the softmax. Only multi-head attention remains.

Chapter 11 sharpens your backpropagation skills to handle these deep stacks; Chapter 12 builds the one missing piece, attention; and Chapter 13 assembles them all into the full Transformer. The hard conceptual work is largely behind you.

Component Quick-Reference

Component	Purpose	Default choice (2024)
Hidden activation	Nonlinearity	GELU or SwiGLU
Output activation	Map to task	Softmax (classification), linear (regression)
Initialization	Stable signal propagation	He (ReLU), scaled-by-depth (Transformer)
Normalization	Stabilize activations	RMSNorm (LLMs), LayerNorm (general)
Regularization	Prevent overfitting	Dropout 0.1, weight decay 0.1
Loss	Define objective	Cross-entropy (classification/LM)
Optimizer	Update weights	AdamW (from Chapter 2)

Exercises

Exercises 1–10 are pen-and-paper; 11–20 require code.

✎

Exercise 1: Pen & Paper

Prove that a single perceptron cannot represent XOR. Show that no weights w₁, w₂, b satisfy all four XOR constraints simultaneously.

✎

Exercise 2: Pen & Paper

Show that an MLP with linear (identity) activations is equivalent to a single linear layer, regardless of depth. Why does this make nonlinearities essential?

✎

Exercise 3: Pen & Paper

Construct explicit weights for a 2→2→1 MLP with ReLU that computes XOR exactly. (Hint: one hidden unit computes OR, the other AND.)

✎

Exercise 4: Pen & Paper

Derive the gradient of GELU(x) = x·Φ(x) using the product rule. Express Φ'(x) in terms of the standard normal density.

✎

Exercise 5: Derive

Derive the He initialization variance Var(W) = 2/n_in for a ReLU layer. Account for the fact that ReLU zeroes half the inputs, halving the variance.

✎

Exercise 6: Pen & Paper

Show that LayerNorm is invariant to scaling and shifting of its input: LN(αx + β) = LN(x) (before the learned γ, β). Why is this property useful?

✎

Exercise 7: Pen & Paper

Inverted dropout scales survivors by 1/(1-p). Prove this preserves the expected activation E[y] = E[x], so no scaling is needed at inference.

✎

Exercise 8: Pen & Paper

Compare the parameter count of a standard ReLU FFN (d→4d→d) with a SwiGLU FFN. What hidden dimension makes SwiGLU parameter-matched to the standard FFN?

✎

Exercise 9: Pen & Paper

Explain why BatchNorm fails for a Transformer processing variable-length sequences with batch size 1, while LayerNorm works fine.

✎

Exercise 10: Pen & Paper

A 50-layer network with sigmoid activations (max gradient 0.25) is trained by backprop. Estimate the gradient magnitude reaching layer 1 relative to the output. Repeat for ReLU.

✎

Exercise 11: Code

Implement a perceptron and confirm it learns AND, OR, NAND but fails on XOR. Plot the decision boundary it converges to for each.

✎

Exercise 12: Code

Implement an MLP that solves XOR from scratch. Visualize the hidden-layer representation of the 4 input points — show they become linearly separable.

✎

Exercise 13: Code

Plot all 7 activation functions and their derivatives on [-5, 5]. Annotate the maximum gradient of each and the regions where each saturates.

✎

Exercise 14: Code Lab

Reproduce the signal-propagation experiment: pass a signal through 50 ReLU layers with tiny, huge, and He initialization. Plot activation std vs. depth for each. Confirm He preserves the signal.

✎

Exercise 15: Code

Implement BatchNorm, LayerNorm, and RMSNorm from scratch. Apply each to a (4, 768) tensor with large mean and verify the normalization properties of each.

✎

Exercise 16: Code

Implement inverted dropout. Empirically verify that the expected activation is preserved across 10,000 trials at p = 0.1, 0.3, 0.5.

✎

Exercise 17: Code Lab

Build and train the complete MLP class on MNIST. Achieve >97% test accuracy. Plot train and validation loss curves and identify the onset of overfitting.

✎

Exercise 18: Code

Ablation study: train the MNIST MLP with and without (a) He init, (b) dropout, (c) LayerNorm. Report the effect of each on final test accuracy and training stability.

✎

Exercise 19: Code

Implement the overfit-one-batch test: confirm your MLP can drive the loss on a 10-example batch to near zero. Then deliberately introduce a bug (e.g., wrong gradient sign) and show the test catches it.

✎

Exercise 20: Code (Challenge)

Build the Pre-LN block skeleton from Section 10.10 with a placeholder attention function (identity). Verify shapes flow correctly through LayerNorm, dropout, FFN, and residual adds for a (8, 64, 256) input. In Chapter 12 you will drop in real attention.

Further reading: “Deep Learning” (Goodfellow, Bengio, Courville, 2016) Chapters 6–8 — the canonical reference for MLPs, regularization, and optimization. “Delving Deep into Rectifiers” (He et al., 2015) for He initialization. “Layer Normalization” (Ba et al., 2016) and “Root Mean Square Layer Normalization” (Zhang & Sennrich, 2019). “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” (Srivastava et al., 2014). Michael Nielsen's free online book “Neural Networks and Deep Learning” for unmatched intuition.

Next → Chapter 11: Backpropagation in Depth

You have built MLPs and updated their weights, but the backward passes so far have been hand-derived for each specific network. Chapter 11 generalizes this: we build a complete automatic differentiation engine that computes gradients for any computational graph. You will understand exactly how PyTorch and JAX work under the hood, build a working autograd system, and gain the skills to debug gradients through the deep stacks of a Transformer.

✎ 20 Exercises in this chapter

Attempt each exercise before checking the worked solutions.

View Solutions →

←

PreviousCh 9. Sequence Models: RNNs & LSTMs

NextCh 11. Backpropagation in Depth

→

Neural Network Fundamentals

Learning Objectives

The Perceptron

The Perceptron Model

The XOR Problem

The Multi-Layer Perceptron

The MLP Forward Pass

The Universal Approximation Theorem

Activation Functions

The ReLU Revolution

GELU and SwiGLU: The Transformer Activations

Weight Initialization

The Variance Propagation Argument

Normalization Layers

Batch Normalization

Layer Normalization

RMSNorm: The Modern Simplification

Dropout

Inverted Dropout

Loss Functions for Neural Networks

Building a Complete MLP

Shape Trace Through the MLP

Shape Trace: MLP forward pass (batch=32)

Training Dynamics & Debugging

Reading Loss Curves

Every Component Inside a Transformer

The Transformer Block Preview

Arch Stack: Pre-LN Transformer Block

Chapter Summary & Exercises

Component Quick-Reference

Exercises