Numerical Stability and Initialization

Dive into Deep Learning · §4.4

Numerical stability & initialization
Why deep nets once refused to train: the problem, the diagnosis, and the cure.

Why does the starting point matter so much?

Motivation

A deep net composes many layers before any loss is seen. The initial weights decide whether a signal survives the trip, forward and back.

Three ideas made deep training routine:

Non-saturating activations (ReLU).
Variance-preserving init (Xavier, He).
Symmetry breaking (random, never constant).

Get init wrong and the gradient dies or blows up before learning starts. Keep score: ten sigmoid layers tax the gradient to 10^{-6}; a hundred random matrices explode past 10^{24}; and one closing plot shows 10^{80} vs 10^{-15} vs flat (naive, Xavier, He).

A 2-layer MLP: input \to affine + ReLU \to hidden \to affine \to logits.

Unstable Gradients

why the chain rule makes depth dangerous

The gradient is a product down the chain

Unstable Gradients

Backprop multiplies one Jacobian per layer. For a weight in layer \ell,

\partial_{\mathbf{W}^{(\ell)}} \mathbf{o} = \underbrace{\mathbf{M}^{(L)} \cdots \mathbf{M}^{(\ell+1)}}_{L-\ell \text{ Jacobians}}\, \mathbf{v}^{(\ell)}, \qquad \mathbf{M}^{(k)} = \partial_{\mathbf{h}^{(k-1)}} \mathbf{h}^{(k)}.

Forward pass builds z, h, o, L; the backward sweep multiplies a gradient back through every node.

Two ways a long product misbehaves

Unstable Gradients

Whether the product grows or shrinks is set by the Jacobians’ scale.

Factors with spectral radius < 1 \Rightarrow the product shrinks geometrically \Rightarrow vanishing gradient: bottom layers stop learning.

Factors with spectral radius > 1 \Rightarrow the product grows geometrically \Rightarrow exploding gradient: updates overshoot, loss goes to NaN.

A constant per-layer factor \rho compounds to \rho^{\,L-\ell}. Only \rho \approx 1 stays usable across depth.

Vanishing: the sigmoid saturates

Unstable Gradients · vanishing

The sigmoid’s derivative peaks at 0.25 and is flat at zero in both tails.

Stack ten such layers and 0.25^{10} \approx 10^{-6}: the bottom layer sees a millionth of the gradient.

ReLU’s derivative is exactly 1 wherever a unit is active, so it does not attenuate the signal. Hence ReLU is the modern default.

Exploding: one hundred random matrices, entries past 10²⁴

Unstable Gradients · exploding

Multiply one hundred \mathcal{N}(0,1) matrices (for i in range(100): M = M @ randn(4, 4)), exactly what a deep linear stack does to a gradient. Each factor is a little too large, and the product compounds:

a single matrix 
 tensor([[-0.9753, -2.4309, -0.6217, -0.9488],
        [ 0.5474, -0.3909,  1.7120,  0.4327],
        [ 1.6211, -0.4342,  1.1437, -0.6168],
        [-0.5642, -0.6920,  0.6003,  0.9131]])
after multiplying 100 matrices
 tensor([[ 1.1962e+23,  1.0111e+23,  1.3713e+23, -2.7019e+23],
        [ 7.4143e+21,  6.2674e+21,  8.4994e+21, -1.6747e+22],
        [-1.2204e+23, -1.0316e+23, -1.3990e+23,  2.7565e+23],
        [ 1.0273e+23,  8.6837e+22,  1.1777e+23, -2.3204e+23]])

A poorly scaled initialization does exactly this to the gradient. No optimizer converges from here.

The three crashes you will actually see

Unstable Gradients · in practice

Loss is NaN from step 1 \to exploding initialization (weights too large).

Loss spikes mid-training \to exploding gradient on a bad batch.

Loss refuses to drop \to vanishing gradient (saturated activations), or a learning rate that is simply too small.

Random init breaks a hidden symmetry

Unstable Gradients · symmetry

Set every weight in a layer to the same constant c:

Every hidden unit computes the same function.
Every unit gets the same gradient.
After each update the weights are still identical.

An h-unit layer is then stuck behaving like a single unit, forever.

Gradient descent alone never breaks this tie. Random initialization does; so does dropout. Bias may still start at 0.

Variance-Preserving Init

keep the signal’s scale constant through depth

Keep the variance constant, layer to layer

Initialization

For a linear layer o_i = \sum_{j=1}^{n_\textrm{in}} w_{ij} x_j with i.i.d. zero-mean weights (\textrm{Var} = \sigma^2) and inputs (\textrm{Var} = \gamma^2):

\mathbb{E}[o_i] = 0, \qquad \textrm{Var}[o_i] = n_\textrm{in}\, \sigma^2\, \gamma^2.

To carry the input’s variance through unchanged (\textrm{Var}[o] = \gamma^2), the only knob is \sigma^2:

\sigma^2 = \frac{1}{n_\textrm{in}}.

Forward and backward disagree, so compromise

Initialization

The same variance count run backward through \mathbf{W}^\top sums over the n_\textrm{out} outputs instead:

\text{forward: } n_\textrm{in}\,\sigma^2 = 1 \qquad \text{backward: } n_\textrm{out}\,\sigma^2 = 1.

Both cannot hold at once unless n_\textrm{in} = n_\textrm{out}, so Xavier splits the difference by averaging the two fan sizes.

Preserve the activation scale on the way in and the gradient scale on the way out: one \sigma^2, two demands.

Xavier and He: one factor of two apart

Initialization

Xavier / Glorot (2010), for \tanh and sigmoid:

\sigma^2 = \frac{2}{n_\textrm{in} + n_\textrm{out}}.

He / Kaiming (2015), for ReLU:

\sigma^2 = \frac{2}{n_\textrm{in}}.

ReLU zeroes half a symmetric signal, halving its second moment (E[\textrm{ReLU}(z)^2] = \tfrac{1}{2}E[z^2]), so He doubles the weight variance to compensate.

Rule of thumb: Xavier for \tanh/sigmoid, He for ReLU. Both ship as named initializers in most libraries.

The demonstration: 10⁸⁰ vs 10⁻¹⁵ vs flat

Initialization · payoff

All three regimes in one plot: push a unit-scale signal through 50 ReLU layers and track E[(h^{(l)})^2] under three weight scales:

\mathcal{N}(0,1): each layer gains \approx n_\textrm{in}/2 = 50\times; explodes to \sim\!10^{80}.
Xavier: off by the rectifier’s \tfrac12 per layer; vanishes like 2^{-l} to \sim\!10^{-15}.
He: essentially flat across all fifty layers.

Init is the floor, not the ceiling

Beyond

Good init buys a deep net that trains without NaNs. To reach hundreds of layers, modern architecture re-normalizes during training:

BatchNorm / LayerNorm rescale activations to unit variance each step, lifting the burden off init.
Residual connections \mathbf{h}^{(\ell+1)} = \mathbf{h}^{(\ell)} + f(\mathbf{h}^{(\ell)}) give the gradient a shortcut, so shrinkage stops compounding.

We return to both in the chapters on modern CNNs.

Recap

Wrap-up

A deep gradient is a product of per-layer Jacobians, so it vanishes or explodes without care.
Vanishing: saturating activations (sigmoid/tanh) crush the signal; ReLU keeps it.
Exploding: over-large weights drive the product, and the loss, to NaN.

Fix the scale: init weights so \textrm{Var} is preserved, via Xavier (\tanh) and He (ReLU).
50-layer experiment: 10^{80} (naive) vs 10^{-15} (Xavier under ReLU) vs flat (He).
Break the symmetry: random init, never a constant.
At scale: normalization + residuals + careful init together reach 100+ layers.

Next (the generalization-in-deep-learning section): the model trains, but why does an over-parametrized network generalize at all?