Numerical Stability and Initialization

Dive into Deep Learning · §4.4

Numerical stability & initialization
Why deep nets once refused to train: the problem, the diagnosis, and the cure.

Why does the starting point matter so much?

Motivation

A deep net composes many layers before any loss is seen. The initial weights decide whether a signal survives the trip, forward and back.

Three ideas made deep training routine:

Non-saturating activations (ReLU).
Variance-preserving init (Xavier, He).
Symmetry breaking (random, never constant).

Get init wrong and the gradient dies or blows up before learning starts. Keep score: ten sigmoid layers tax the gradient to 10^{-6}; a hundred random matrices explode past 10^{24}; and one closing plot shows 10^{80} vs 10^{-15} vs flat (naive, Xavier, He).

A 2-layer MLP: input \to affine + ReLU \to hidden \to affine \to logits.

Unstable Gradients

why the chain rule makes depth dangerous

The gradient is a product down the chain

Unstable Gradients

Backprop multiplies one Jacobian per layer. For a weight in layer \ell,

\partial_{\mathbf{W}^{(\ell)}} \mathbf{o} = \underbrace{\mathbf{M}^{(L)} \cdots \mathbf{M}^{(\ell+1)}}_{L-\ell \text{ Jacobians}}\, \mathbf{v}^{(\ell)}, \qquad \mathbf{M}^{(k)} = \partial_{\mathbf{h}^{(k-1)}} \mathbf{h}^{(k)}.

Forward pass builds z, h, o, L; the backward sweep multiplies a gradient back through every node.

Two ways a long product misbehaves

Unstable Gradients

Whether the product grows or shrinks is set by the Jacobians’ scale.

Factors with spectral radius < 1 \Rightarrow the product shrinks geometrically \Rightarrow vanishing gradient: bottom layers stop learning.

Factors with spectral radius > 1 \Rightarrow the product grows geometrically \Rightarrow exploding gradient: updates overshoot, loss goes to NaN.

A constant per-layer factor \rho compounds to \rho^{\,L-\ell}. Only \rho \approx 1 stays usable across depth.

Vanishing: the sigmoid saturates

Unstable Gradients · vanishing

The sigmoid’s derivative peaks at 0.25 and is flat at zero in both tails.

Stack ten such layers and 0.25^{10} \approx 10^{-6}: the bottom layer sees a millionth of the gradient.

ReLU’s derivative is exactly 1 wherever a unit is active, so it does not attenuate the signal. Hence ReLU is the modern default.

Exploding: one hundred random matrices, entries past 10²⁴

Unstable Gradients · exploding

Multiply one hundred \mathcal{N}(0,1) matrices (for i in range(100): M = M @ randn(4, 4)), exactly what a deep linear stack does to a gradient. Each factor is a little too large, and the product compounds:

a single matrix 
 tf.Tensor(
[[ 0.88569    -0.53551334 -2.0258462  -1.6222503 ]
 [ 0.08253931  0.37377733 -1.1129707   0.7584501 ]
 [-0.30639806  0.44809902  1.0800251  -0.8498218 ]
 [-0.37185267 -1.3523052   0.4892188  -1.1749656 ]], shape=(4, 4), dtype=float32)
after multiplying 100 matrices
 [[-3.3796842e+23 -1.4009740e+23  4.3313110e+23  7.5757996e+23]
 [ 8.0702145e+22  3.3508249e+22 -1.0345620e+23 -1.8100430e+23]
 [-1.6039657e+23 -6.6450585e+22  2.0551867e+23  3.5942717e+23]
 [-7.8191073e+22 -3.2419027e+22  1.0021750e+23  1.7529337e+23]]

A poorly scaled initialization does exactly this to the gradient. No optimizer converges from here.

The three crashes you will actually see

Unstable Gradients · in practice

Loss is NaN from step 1 \to exploding initialization (weights too large).

Loss spikes mid-training \to exploding gradient on a bad batch.

Loss refuses to drop \to vanishing gradient (saturated activations), or a learning rate that is simply too small.

Random init breaks a hidden symmetry

Unstable Gradients · symmetry

Set every weight in a layer to the same constant c:

Every hidden unit computes the same function.
Every unit gets the same gradient.
After each update the weights are still identical.

An h-unit layer is then stuck behaving like a single unit, forever.

Gradient descent alone never breaks this tie. Random initialization does; so does dropout. Bias may still start at 0.

Variance-Preserving Init

keep the signal’s scale constant through depth

Keep the variance constant, layer to layer

Initialization

For a linear layer o_i = \sum_{j=1}^{n_\textrm{in}} w_{ij} x_j with i.i.d. zero-mean weights (\textrm{Var} = \sigma^2) and inputs (\textrm{Var} = \gamma^2):

\mathbb{E}[o_i] = 0, \qquad \textrm{Var}[o_i] = n_\textrm{in}\, \sigma^2\, \gamma^2.

To carry the input’s variance through unchanged (\textrm{Var}[o] = \gamma^2), the only knob is \sigma^2:

\sigma^2 = \frac{1}{n_\textrm{in}}.

Forward and backward disagree, so compromise

Initialization

The same variance count run backward through \mathbf{W}^\top sums over the n_\textrm{out} outputs instead:

\text{forward: } n_\textrm{in}\,\sigma^2 = 1 \qquad \text{backward: } n_\textrm{out}\,\sigma^2 = 1.

Both cannot hold at once unless n_\textrm{in} = n_\textrm{out}, so Xavier splits the difference by averaging the two fan sizes.

Preserve the activation scale on the way in and the gradient scale on the way out: one \sigma^2, two demands.

Xavier and He: one factor of two apart

Initialization

Xavier / Glorot (2010), for \tanh and sigmoid:

\sigma^2 = \frac{2}{n_\textrm{in} + n_\textrm{out}}.

He / Kaiming (2015), for ReLU:

\sigma^2 = \frac{2}{n_\textrm{in}}.

ReLU zeroes half a symmetric signal, halving its second moment (E[\textrm{ReLU}(z)^2] = \tfrac{1}{2}E[z^2]), so He doubles the weight variance to compensate.

Rule of thumb: Xavier for \tanh/sigmoid, He for ReLU. Both ship as named initializers in most libraries.

The demonstration: 10⁸⁰ vs 10⁻¹⁵ vs flat

Initialization · payoff

All three regimes in one experiment: push a unit-scale signal through 50 ReLU layers of width 100 and track the second moment E[(h^{(l)})^2] layer by layer, under three weight scales.

\mathcal{N}(0,1): each layer gains \approx n_\textrm{in}/2 = 50\times, compounding to an astronomical \sim\!10^{80} by layer 50, the exploding regime.
Xavier: derived for linear layers, off by exactly the rectifier’s \tfrac12 per layer, so the signal vanishes like 2^{-l}, reaching \sim\!10^{-15}.
He: compensates for the rectifier and holds the scale essentially flat across all fifty layers.

Only the He-initialized stack delivers usable forward signals here. A backward variance calculation gives the same scale under mean-field independence assumptions. Run the sweep yourself in the notebook.

Init is the floor, not the ceiling

Beyond

Good init buys a deep net that trains without NaNs. To reach hundreds of layers, modern architecture re-normalizes during training:

BatchNorm / LayerNorm rescale activations to unit variance each step, lifting the burden off init.
Residual connections \mathbf{h}^{(\ell+1)} = \mathbf{h}^{(\ell)} + f(\mathbf{h}^{(\ell)}) give the gradient a shortcut, so shrinkage stops compounding.

We return to both in the chapters on modern CNNs.

Recap

Wrap-up

A deep gradient is a product of per-layer Jacobians, so it vanishes or explodes without care.
Vanishing: saturating activations (sigmoid/tanh) crush the signal; ReLU keeps it.
Exploding: over-large weights drive the product, and the loss, to NaN.

Fix the scale: init weights so \textrm{Var} is preserved, via Xavier (\tanh) and He (ReLU).
50-layer experiment: 10^{80} (naive) vs 10^{-15} (Xavier under ReLU) vs flat (He).
Break the symmetry: random init, never a constant.
At scale: normalization + residuals + careful init together reach 100+ layers.

Next (the generalization-in-deep-learning section): the model trains, but why does an over-parametrized network generalize at all?