%matplotlib inline
from d2l import mxnet as d2l
from mxnet import autograd, np, npx
npx.set_np()Why was deep learning hard before 2012? Nobody could train deep networks reliably — gradients either died at zero or blew up to infinity.
Three ingredients fixed it:
This deck makes the failure modes concrete.
For an L-layer network with hidden states \mathbf{h}^{(1)}, \mathbf{h}^{(2)}, \ldots, the gradient of the loss with respect to a weight in layer \ell is
\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(\ell)}} = \underbrace{\frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(L)}}}_{\text{loss}} \cdot \underbrace{\frac{\partial \mathbf{h}^{(L)}}{\partial \mathbf{h}^{(L-1)}}}_{\mathbf{M}_L} \cdots \underbrace{\frac{\partial \mathbf{h}^{(\ell+1)}}{\partial \mathbf{h}^{(\ell)}}}_{\mathbf{M}_{\ell+1}} \cdot \frac{\partial \mathbf{h}^{(\ell)}}{\partial \mathbf{W}^{(\ell)}}.
It’s a product of L - \ell Jacobian matrices. Two ways this product can misbehave:
The sigmoid’s derivative peaks at \sigma'(0) = 0.25 and collapses to zero at the tails. In a 10-layer stack 0.25^{10} \approx 10^{-6} — gradients at layer 1 are a millionth of those near the output. ReLU fixes this: derivative is exactly 1 wherever the unit is active.
Multiply 100 random 4\times4 Gaussian matrices and watch the entries:
Random Gaussian matrices have spectral radius > 1, so the product diverges. Same effect on gradients in a deep net with poorly scaled weights — loss goes to NaN in a few hundred steps.
Forward pass through a linear layer with n_{\text{in}} inputs:
o_i = \sum_{j=1}^{n_{\text{in}}} w_{ij}\, x_j.
If w_{ij} \sim \mathcal{N}(0, \sigma^2) and inputs are i.i.d. with variance \gamma^2:
\mathbb{E}[o_i] = 0,\quad \mathrm{Var}[o_i] = n_{\text{in}}\, \sigma^2\, \gamma^2.
For variance to be preserved layer-to-layer (\mathrm{Var}[o] = \gamma^2):
\boxed{\sigma^2 = \frac{1}{n_{\text{in}}}}.
Same argument for the backward pass gives \sigma^2 = 1/n_{\text{out}}. Can’t satisfy both — so Xavier averages them.
Xavier / Glorot (2010):
\sigma^2 = \frac{2}{n_{\text{in}} + n_{\text{out}}}.
Preserves variance both forward and backward. Designed for \tanh / sigmoid.
Kaiming / He (2015):
\sigma^2 = \frac{2}{n_{\text{in}}}.
Same idea, but compensates for ReLU halving the post-activation variance. Default for modern CNNs and Transformers.
Both ship as defaults in every framework. Bias starts at 0.
Set every weight to the same constant c:
Initialize randomly — even tiny noise breaks the permutation symmetry between hidden units. (SGD alone doesn’t.)
Init alone gets you “trains a 10-layer net without NaN”. Modern best practice for hundreds of layers stacks more on top:
The chapter on Modern CNNs revisits these.