Multilayer Perceptrons

Dive into Deep Learning · §4.1

Multilayer Perceptrons
one kink between affine layers · XOR untangled · any function, hinge by hinge · why depth beats width.

A linear model draws one straight boundary

Motivation

Softmax regression is a single affine map: monotonic, line-shaped decisions.

Body temperature → risk rises on both sides of 37°C.
Cat vs dog: pixel (13,17) means nothing without its neighbours.
XOR: a line provably cannot separate it.

The fix: learn the features, keep the linear predictor on top. A two-unit net computes XOR exactly, and depth multiplies what width merely adds.

Two classes that no single straight line can separate.

From Linear to Nonlinear

hidden layers, and why they need a kink

The idea: insert hidden layers

Architecture

Stack fully-connected layers. The middle ones are hidden: neither input nor output. Every unit sees every unit below it.

We read the first layers as a learned representation and the last as a linear predictor on top of it.

One hidden layer: 4 inputs, 5 hidden units, 3 outputs, all fully connected.

One hidden layer, written out

Architecture

For a minibatch \mathbf{X} \in \mathbb{R}^{n \times d}, hidden width h, and q outputs:

\mathbf{H} = \mathbf{X} \mathbf{W}^{(1)} + \mathbf{b}^{(1)}, \qquad \mathbf{O} = \mathbf{H} \mathbf{W}^{(2)} + \mathbf{b}^{(2)}.

Two weight matrices, two biases. It looks like we have bought ourselves a more powerful model.

But two affine maps collapse into one

The catch

Substitute \mathbf{H} into the output layer:

\mathbf{O} = (\mathbf{X} \mathbf{W}^{(1)} + \mathbf{b}^{(1)})\,\mathbf{W}^{(2)} + \mathbf{b}^{(2)} = \mathbf{X}\,\underbrace{\mathbf{W}^{(1)}\mathbf{W}^{(2)}}_{=\,\mathbf{W}} + \underbrace{\mathbf{b}^{(1)}\mathbf{W}^{(2)} + \mathbf{b}^{(2)}}_{=\,\mathbf{b}}.

An affine function of an affine function is still affine. The hidden layer added zero expressive power.

Stacking linear layers is wasted effort: we are back to plain softmax regression.

The missing ingredient: a nonlinearity

The fix

Apply an elementwise nonlinearity \sigma after every hidden affine map:

\mathbf{H} = \sigma\!\left(\mathbf{X} \mathbf{W}^{(1)} + \mathbf{b}^{(1)}\right),\qquad \mathbf{O} = \mathbf{H} \mathbf{W}^{(2)} + \mathbf{b}^{(2)}.

Now the layers can no longer be merged: the network bends, folds, and curves its decision surface. Two ingredients (affine + nonlinear), and every architecture in this book follows.

A Concrete Win: XOR

one ReLU layer untangles the impossible case

XOR: impossible for a line, easy after a fold

Why nonlinearity matters

Label each corner of the unit square by whether its coordinates differ. The two classes sit on opposite diagonals (left), so no straight line works.

One hidden layer \mathbf{h} = \operatorname{ReLU}(\mathbf{x}\mathbf{W}^{(1)} + \mathbf{b}^{(1)}) then folds the two label-1 corners onto the same point (right), and now a single line separates them.

Left: XOR in the input space, not linearly separable. Right: after the ReLU hidden map the class-1 corners coincide and a line works.

First receipt: all four corners, exactly right

XOR · verified

With \mathbf{W}^{(1)} = \left(\begin{smallmatrix}1 & 1\\ 1 & 1\end{smallmatrix}\right), \mathbf{b}^{(1)} = (0,\,{-1}), \mathbf{w}^{(2)} = (1,\,{-2})^\top and a ReLU, pushing all four corners through by hand gives

x_1	x_2	\mathbf{h} = \operatorname{ReLU}(\mathbf{x}\mathbf{W}^{(1)} + \mathbf{b}^{(1)})	o = h_1 - 2h_2	XOR
0	0	(0,\ 0)	0	0 ✓
0	1	(1,\ 0)	1	1 ✓
1	0	(1,\ 0)	1	1 ✓
1	1	(2,\ 1)	0	0 ✓

We constructed these weights; the rest of the book is about having optimization discover such representations. Watch that happen live on the XOR and spiral datasets at the TensorFlow Playground (playground.tensorflow.org).

How far does this go? Universal approximation

Expressive power

Universal approximation theorem. A single hidden layer with enough units can approximate any continuous function on a bounded domain, to arbitrary accuracy, for any non-polynomial \sigma, ReLU included (Cybenko 1989; Leshno et al. 1993).

“Enough units” can be exponentially many; the theorem says a fit exists, not that SGD finds it, nor that it generalizes.

This is why we reach for depth: a deep net often represents the same function far more compactly than a shallow one would, trading width for layers.

Why it is plausible: one hinge at a time

Expressive power

Each ReLU unit contributes a hinge a_k\operatorname{ReLU}(x - t_k): with D units the output is piecewise linear with at most D+1 pieces. Approximating a curve is then just fitting a polyline: more joints, less error.

Three hinges (left) sum to a 4-piece polyline that tracks the smooth target (right); the shaded band is the error.

Second receipt: depth multiplies pieces, width only adds

Expressive power · verified

Evaluate randomly initialized ReLU MLPs on a dense 1-D grid, detect where the slope jumps, and count the linear pieces (mean over 20 draws):

width D	2	4	8	16
bound D+1	3	5	9	17
depth 1	2.6	4.3	7.5	14.4
depth 2	3.5	7.0	13.9	27.4
depth 3	3.6	8.1	22.1	40.1

One layer of width D: at most D+1 pieces, as promised. Each extra layer folds the graph, roughly multiplying the count, the multiplicative-vs-additive gap that makes depth pay.

Activation Functions

ReLU, sigmoid, tanh, and when to use each

ReLU: the modern default

Activations

\operatorname{ReLU}(x) = \max(0, x).

x = np.arange(-8.0, 8.0, 0.1)
x.attach_grad()
with autograd.record():
    y = npx.relu(x)
d2l.plot(x, y, 'x', 'relu(x)', figsize=(5, 2.5))

Keep the positive part, zero the rest. Why it won:

No right-side saturation: gradient is exactly 1 for x>0.
Cheap: a single comparison, no exponential.
Sparse: about half the units output 0.

ReLU’s gradient: an on/off step

Activations

The derivative is a step: 0 on the left, 1 on the right:

\operatorname{ReLU}'(x) = \mathbb{1}[x > 0].

y.backward()
d2l.plot(x, x.grad, 'x', 'grad of relu', figsize=(5, 2.5))

Dead ReLU: a unit pushed negative for every example gets zero gradient forever. LeakyReLU / PReLU, \max(0,x)+\alpha\min(0,x), leak a little signal to keep it alive.

Sigmoid: squashing into (0, 1)

Activations

\operatorname{sigmoid}(x) = \frac{1}{1 + e^{-x}}.

with autograd.record():
    y = npx.sigmoid(x)
d2l.plot(x, y, 'x', 'sigmoid(x)', figsize=(5, 2.5))

A smooth, differentiable threshold, and the original neuron activation. Today it lives mostly at the edges of a net:

Binary output, read as a probability.
Gates in LSTM/GRU and attention.

Why sigmoid stalls deep networks

Activations · the catch

\operatorname{sigmoid}'(x) = \operatorname{sigmoid}(x)\,(1 - \operatorname{sigmoid}(x)).

y.backward()
d2l.plot(x, x.grad, 'x', 'grad of sigmoid', figsize=(5, 2.5))

The gradient peaks at just 0.25 and vanishes past |x|\gtrsim 5. Even at its best, ten stacked layers attenuate the backward signal by 0.25^{10} \approx 10^{-6}: the vanishing-gradient problem ReLU fixed (the full story in the numerical-stability section).

Tanh: sigmoid’s zero-centered cousin

Activations

\tanh(x) = \frac{1 - e^{-2x}}{1 + e^{-2x}} = 2\,\operatorname{sigmoid}(2x) - 1.

with autograd.record():
    y = np.tanh(x)
d2l.plot(x, y, 'x', 'tanh(x)', figsize=(5, 2.5))

Same S-shape, but range (-1,1) and zero-centered, which mildly eases optimization. The default inside RNN cells, where bounded activations help.

Still saturates at both tails, so its gradient vanishes just like sigmoid’s.

Wrap-up

choosing an activation, plus what comes next

Activation cheat sheet

Reference

	Range	Saturates?	Typical use
ReLU	[0, \infty)	left only (can die)	default hidden layer
LeakyReLU / PReLU	\mathbb{R}	no	when ReLU dies
GELU \,x\Phi(x)	\approx\mathbb{R}	barely	BERT, GPT-2-style Transformers
SiLU / SwiGLU	\mathbb{R}	barely	many recent language models
Sigmoid	(0, 1)	both ends	gates, binary output
Tanh	(-1, 1)	both ends	RNN cells
Softmax	simplex	one end	multiclass output

Use ReLU as a simple hidden-layer baseline. Transformer families differ: some use GELU, while many recent language models use gated SiLU/SwiGLU blocks. Use sigmoid / softmax at outputs when the model calls for probabilities.

Recap

Wrap-up

An MLP = affine layers with an elementwise nonlinearity between them.
The nonlinearity is essential; drop it and the stack collapses to one affine map.
XOR is the smallest proof: one ReLU layer re-represents the data so a line works.

One wide hidden layer is a universal approximator: one hinge per unit, \le D+1 pieces; depth multiplies pieces and makes that power parameter-efficient.
ReLU is the default; sigmoid and tanh survive in gates, outputs, and RNN cells.

Next (the MLP-implementation section): build one and train it on Fashion-MNIST, from scratch, then in a few high-level API lines.