%matplotlib inline
from d2l import tensorflow as d2l
import tensorflow as tfA multilayer perceptron (MLP) is a stack of fully-connected layers separated by elementwise nonlinearities. The simplest deep network — and the foundation everything else in this book builds on.
A linear classifier draws one hyperplane per class. That’s not enough for most things we want to model:
Stack linear layers with a nonlinearity between them. The linear layers mix features; the nonlinearity lets the composition curve, fold, and twist the decision surface.
That’s it. Two ingredients, deep architectures from there.
An MLP is a stack of fully-connected layers. The middle layers are hidden — neither input nor output:
One hidden layer with five units, four inputs, three outputs.
Math for the one-hidden-layer case (minibatch \mathbf{X} \in \mathbb{R}^{n \times d}, hidden width h, q outputs):
\mathbf{H} = \mathbf{X} \mathbf{W}^{(1)} + \mathbf{b}^{(1)}, \qquad \mathbf{O} = \mathbf{H} \mathbf{W}^{(2)} + \mathbf{b}^{(2)}.
Two layers. Two weight matrices. Two biases. So far it looks like genuine progress.
Plug \mathbf{H} from the first equation into the second:
\mathbf{O} = (\mathbf{X} \mathbf{W}^{(1)} + \mathbf{b}^{(1)})\,\mathbf{W}^{(2)} + \mathbf{b}^{(2)} = \mathbf{X}\,\underbrace{\mathbf{W}^{(1)}\mathbf{W}^{(2)}}_{=\mathbf{W}} + \underbrace{\mathbf{b}^{(1)}\mathbf{W}^{(2)} + \mathbf{b}^{(2)}}_{=\mathbf{b}}.
A composition of affine maps is just another affine map. The hidden layer adds zero expressive power — same model class as plain softmax regression.
You need a nonlinearity between the layers, or stacking is wasted.
Insert an elementwise nonlinearity \sigma after every hidden layer:
\mathbf{H} = \sigma(\mathbf{X} \mathbf{W}^{(1)} + \mathbf{b}^{(1)}),\qquad \mathbf{O} = \mathbf{H} \mathbf{W}^{(2)} + \mathbf{b}^{(2)}.
Now the network represents a piecewise nonlinear function — and stacking actually buys us something.
Universal approximation theorem (Cybenko 1989): a single hidden layer with enough units, plus a sane \sigma, can approximate any continuous function arbitrarily well.
Caveat: “enough units” can be exponentially many. Depth trades width for parameter efficiency — the modern reason deep nets work.
\mathrm{ReLU}(x) = \max(0, x).
Three reasons it dominates:
The derivative is just the step function — 0 for negative inputs, 1 for positive:
\mathrm{ReLU}'(x) = \mathbb{1}[x > 0].
A unit whose pre-activation is always negative gets zero gradient and never updates again — a permanently silent neuron.
The fix: LeakyReLU / PReLU — \max(0, x) + \alpha\min(0, x), with a small slope on the left to keep gradient flowing.
\sigma(x) = \frac{1}{1 + e^{-x}}.
The original neural net activation (1960s–2000s). Today mostly used for:
For hidden layers it’s been replaced by ReLU: see why on the next slide.
\sigma'(x) = \sigma(x)(1 - \sigma(x)).
Maximum gradient is \sigma'(0) = 0.25. Worse, \sigma' vanishes for |x| \gtrsim 5.
In a 10-layer net with sigmoid activations, the backward pass multiplies \le 0.25 at every layer — gradients shrink by \le 4^{-10} \approx 10^{-6} before reaching the input layer. That’s the vanishing gradient problem ReLU solved.
\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = 2\sigma(2x) - 1.
Range (-1, 1) — zero-centered, which mildly helps optimization. Default in RNNs (LSTM cell update, GRU candidate hidden state) where bounded activations are useful.
Still saturates at both tails — same vanishing-gradient issue as sigmoid:
| Range | Saturates? | Use case | |
|---|---|---|---|
| ReLU | [0, \infty) | only at x{<}0 (dead) | default for hidden |
| LeakyReLU / PReLU | \mathbb{R} | no | when ReLU dies |
| GELU (x\Phi(x)) | \approx \mathbb{R} | barely | Transformers, modern LLMs |
| Sigmoid | (0, 1) | both ends | gates, binary output |
| Tanh | (-1, 1) | both ends | RNN cells |
| Softmax | simplex | one end | multiclass output |
Default: ReLU for hidden layers, GELU if you’re imitating modern Transformer models, sigmoid/softmax at outputs to turn logits into probabilities.