%matplotlib inline
from d2l import torch as d2l
import torchWhy was deep learning hard before 2012? Nobody could train deep networks reliably — gradients either died at zero or blew up to infinity.
Three ingredients fixed it:
This deck makes the failure modes concrete.
For an L-layer network with hidden states \mathbf{h}^{(1)}, \mathbf{h}^{(2)}, \ldots, the gradient of the loss with respect to a weight in layer \ell is
\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(\ell)}} = \underbrace{\frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(L)}}}_{\text{loss}} \cdot \underbrace{\frac{\partial \mathbf{h}^{(L)}}{\partial \mathbf{h}^{(L-1)}}}_{\mathbf{M}_L} \cdots \underbrace{\frac{\partial \mathbf{h}^{(\ell+1)}}{\partial \mathbf{h}^{(\ell)}}}_{\mathbf{M}_{\ell+1}} \cdot \frac{\partial \mathbf{h}^{(\ell)}}{\partial \mathbf{W}^{(\ell)}}.
It’s a product of L - \ell Jacobian matrices. Two ways this product can misbehave:
The sigmoid’s derivative peaks at \sigma'(0) = 0.25 and collapses to zero at the tails. In a 10-layer stack 0.25^{10} \approx 10^{-6} — gradients at layer 1 are a millionth of those near the output. ReLU fixes this: derivative is exactly 1 wherever the unit is active.
Multiply 100 random 4\times4 Gaussian matrices and watch the entries:
a single matrix
tensor([[-1.3924, 0.2950, 0.7600, 0.6810],
[ 1.0807, 0.6017, -0.8665, -0.3220],
[ 0.1475, 0.5553, -0.0394, -0.4018],
[-0.3803, -0.4364, 2.2436, 0.2780]])
after multiplying 100 matrices
tensor([[ 4.2099e+26, 1.0123e+27, 2.4245e+27, 3.2120e+27],
[-4.4350e+26, -1.0664e+27, -2.5542e+27, -3.3838e+27],
[-3.2478e+25, -7.8095e+25, -1.8705e+26, -2.4780e+26],
[ 7.2068e+26, 1.7329e+27, 4.1504e+27, 5.4986e+27]])
Random Gaussian matrices have spectral radius > 1, so the product diverges. Same effect on gradients in a deep net with poorly scaled weights — loss goes to NaN in a few hundred steps.
Forward pass through a linear layer with n_{\text{in}} inputs:
o_i = \sum_{j=1}^{n_{\text{in}}} w_{ij}\, x_j.
If w_{ij} \sim \mathcal{N}(0, \sigma^2) and inputs are i.i.d. with variance \gamma^2:
\mathbb{E}[o_i] = 0,\quad \mathrm{Var}[o_i] = n_{\text{in}}\, \sigma^2\, \gamma^2.
For variance to be preserved layer-to-layer (\mathrm{Var}[o] = \gamma^2):
\boxed{\sigma^2 = \frac{1}{n_{\text{in}}}}.
Same argument for the backward pass gives \sigma^2 = 1/n_{\text{out}}. Can’t satisfy both — so Xavier averages them.
Xavier / Glorot (2010):
\sigma^2 = \frac{2}{n_{\text{in}} + n_{\text{out}}}.
Preserves variance both forward and backward. Designed for \tanh / sigmoid.
Kaiming / He (2015):
\sigma^2 = \frac{2}{n_{\text{in}}}.
Same idea, but compensates for ReLU halving the post-activation variance. Default for modern CNNs and Transformers.
Both ship as defaults in every framework. Bias starts at 0.
Set every weight to the same constant c:
Initialize randomly — even tiny noise breaks the permutation symmetry between hidden units. (SGD alone doesn’t.)
Init alone gets you “trains a 10-layer net without NaN”. Modern best practice for hundreds of layers stacks more on top:
The chapter on Modern CNNs revisits these.