ReLU — the modern default

Multilayer Perceptrons

MLPs add nonlinear hidden layers

A multilayer perceptron (MLP) is a stack of fully-connected layers separated by elementwise nonlinearities. The simplest deep network — and the foundation everything else in this book builds on.

A linear classifier draws one hyperplane per class. That’s not enough for most things we want to model:

  • Body temperature → health risk — U-shaped, not even monotonic.
  • Cat vs dog from pixels — the meaning of pixel (13, 17) depends on its neighbors.
  • XOR — the canonical small problem a linear model provably cannot solve.

The fix: alternate linear and nonlinear

Stack linear layers with a nonlinearity between them. The linear layers mix features; the nonlinearity lets the composition curve, fold, and twist the decision surface.

That’s it. Two ingredients, deep architectures from there.

Architecture

An MLP is a stack of fully-connected layers. The middle layers are hidden — neither input nor output:

One hidden layer with five units, four inputs, three outputs.

Math for the one-hidden-layer case (minibatch \mathbf{X} \in \mathbb{R}^{n \times d}, hidden width h, q outputs):

\mathbf{H} = \mathbf{X} \mathbf{W}^{(1)} + \mathbf{b}^{(1)}, \qquad \mathbf{O} = \mathbf{H} \mathbf{W}^{(2)} + \mathbf{b}^{(2)}.

Two layers. Two weight matrices. Two biases. So far it looks like genuine progress.

Why naïve stacking doesn’t help

Plug \mathbf{H} from the first equation into the second:

\mathbf{O} = (\mathbf{X} \mathbf{W}^{(1)} + \mathbf{b}^{(1)})\,\mathbf{W}^{(2)} + \mathbf{b}^{(2)} = \mathbf{X}\,\underbrace{\mathbf{W}^{(1)}\mathbf{W}^{(2)}}_{=\mathbf{W}} + \underbrace{\mathbf{b}^{(1)}\mathbf{W}^{(2)} + \mathbf{b}^{(2)}}_{=\mathbf{b}}.

A composition of affine maps is just another affine map. The hidden layer adds zero expressive power — same model class as plain softmax regression.

You need a nonlinearity between the layers, or stacking is wasted.

Activation functions: the missing ingredient

Insert an elementwise nonlinearity \sigma after every hidden layer:

\mathbf{H} = \sigma(\mathbf{X} \mathbf{W}^{(1)} + \mathbf{b}^{(1)}),\qquad \mathbf{O} = \mathbf{H} \mathbf{W}^{(2)} + \mathbf{b}^{(2)}.

Now the network represents a piecewise nonlinear function — and stacking actually buys us something.

Universal approximation theorem (Cybenko 1989): a single hidden layer with enough units, plus a sane \sigma, can approximate any continuous function arbitrarily well.

Caveat: “enough units” can be exponentially many. Depth trades width for parameter efficiency — the modern reason deep nets work.

Setup

%matplotlib inline
from d2l import mxnet as d2l
from mxnet import autograd, np, npx
npx.set_np()

\mathrm{ReLU}(x) = \max(0, x).

x = np.arange(-8.0, 8.0, 0.1)
x.attach_grad()
with autograd.record():
    y = npx.relu(x)
d2l.plot(x, y, 'x', 'relu(x)', figsize=(5, 2.5))

Three reasons it dominates:

  • Doesn’t saturate on the right — gradient is exactly 1 for any x > 0. No vanishing gradient.
  • Cheap — one comparison, one max. No exponential.
  • Sparse activations — half the units output zero on average; acts as implicit regularization.

ReLU’s derivative

The derivative is just the step function — 0 for negative inputs, 1 for positive:

\mathrm{ReLU}'(x) = \mathbb{1}[x > 0].

y.backward()
d2l.plot(x, x.grad, 'x', 'grad of relu', figsize=(5, 2.5))

Dead ReLU

A unit whose pre-activation is always negative gets zero gradient and never updates again — a permanently silent neuron.

The fix: LeakyReLU / PReLU\max(0, x) + \alpha\min(0, x), with a small slope on the left to keep gradient flowing.

Sigmoid — squashes to (0, 1)

\sigma(x) = \frac{1}{1 + e^{-x}}.

with autograd.record():
    y = npx.sigmoid(x)
d2l.plot(x, y, 'x', 'sigmoid(x)', figsize=(5, 2.5))

The original neural net activation (1960s–2000s). Today mostly used for:

  • Output layers in binary classification (probability ∈ (0, 1)).
  • Gates in LSTM/GRU and attention (still ∈ (0, 1)).

For hidden layers it’s been replaced by ReLU: see why on the next slide.

Why sigmoid hurts deep nets

\sigma'(x) = \sigma(x)(1 - \sigma(x)).

y.backward()
d2l.plot(x, x.grad, 'x', 'grad of sigmoid', figsize=(5, 2.5))

Maximum gradient is \sigma'(0) = 0.25. Worse, \sigma' vanishes for |x| \gtrsim 5.

In a 10-layer net with sigmoid activations, the backward pass multiplies \le 0.25 at every layer — gradients shrink by \le 4^{-10} \approx 10^{-6} before reaching the input layer. That’s the vanishing gradient problem ReLU solved.

Tanh — sigmoid’s symmetric cousin

\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = 2\sigma(2x) - 1.

with autograd.record():
    y = np.tanh(x)
d2l.plot(x, y, 'x', 'tanh(x)', figsize=(5, 2.5))

Range (-1, 1)zero-centered, which mildly helps optimization. Default in RNNs (LSTM cell update, GRU candidate hidden state) where bounded activations are useful.

Tanh’s derivative

Still saturates at both tails — same vanishing-gradient issue as sigmoid:

y.backward()
d2l.plot(x, x.grad, 'x', 'grad of tanh', figsize=(5, 2.5))

Cheat sheet

Range Saturates? Use case
ReLU [0, \infty) only at x{<}0 (dead) default for hidden
LeakyReLU / PReLU \mathbb{R} no when ReLU dies
GELU (x\Phi(x)) \approx \mathbb{R} barely Transformers, modern LLMs
Sigmoid (0, 1) both ends gates, binary output
Tanh (-1, 1) both ends RNN cells
Softmax simplex one end multiclass output

Default: ReLU for hidden layers, GELU if you’re imitating modern Transformer models, sigmoid/softmax at outputs to turn logits into probabilities.

Recap

  • An MLP = several affine layers, with an elementwise nonlinearity between them.
  • The nonlinearity is essential — without it the stack collapses to a single affine map.
  • One sufficiently wide hidden layer is a universal approximator. Depth makes the same expressiveness parameter-efficient.
  • ReLU is the modern default. Sigmoid and tanh persist in specific roles (output, gates, RNNs) where their bounded ranges are useful.
  • The whole rest of this chapter is about training MLPs: forward pass, backprop, init, regularization.