from d2l import mxnet as d2l
from mxnet import autograd, gluon, init, np, npx
from mxnet.gluon import nn
npx.set_np()Dropout (Srivastava, Hinton et al., 2014) is the simplest and most widely used regularizer for neural networks:
During training, set each hidden unit to zero independently with probability p. Rescale the survivors by 1/(1-p). Turn it off at test time.
Counterintuitive — we actively damage the network mid-training — but the trick is rock-solid. It still ships in modern Transformers (~10% rate standard).
Modern networks are overparameterized — more weights than training examples. Without a regularizer, gradient descent happily memorizes the training set.
Two complementary reasons dropout helps:
On every minibatch we randomly zero a fraction of hidden units; the network on this iteration is a thinned subnetwork. Across iterations we sample many subnetworks:
Two of the five hidden units zeroed by a single dropout draw. Each iteration samples a different subset.
At test time dropout is off — we use the full network. Effectively we average exponentially many thinned subnetworks (a kind of cheap ensemble).
Per hidden unit h, replace with
h' = \begin{cases} 0 & \text{with probability } p, \\ \dfrac{h}{1 - p} & \text{otherwise.} \end{cases}
The rescaling 1/(1-p) is what makes \mathbb{E}[h'] = h. Without it, expected activations shrink by (1-p) during training but recover their full scale at test time → train/test mismatch.
This is “inverted dropout”; the version every modern framework uses.
Sample a Bernoulli mask, multiply, rescale:
Quick check on a 2×8 input:
After the activation, before the next linear layer:
Linear → ReLU → Dropout(p₁) → Linear → ReLU → Dropout(p₂) → Linear
Convention: less on early layers (low-level features need to be reliable), more later (high-level features overfit).
Typical values:
class DropoutMLPScratch(d2l.Classifier):
def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
dropout_1, dropout_2, lr):
super().__init__()
self.save_hyperparameters()
self.lin1 = nn.Dense(num_hiddens_1, activation='relu')
self.lin2 = nn.Dense(num_hiddens_2, activation='relu')
self.lin3 = nn.Dense(num_outputs)
self.initialize()
def forward(self, X):
H1 = self.lin1(X)
if autograd.is_training():
H1 = dropout_layer(H1, self.dropout_1)
H2 = self.lin2(H1)
if autograd.is_training():
H2 = dropout_layer(H2, self.dropout_2)
return self.lin3(H2)Two hidden layers (256 each), dropout 0.5 between them:
Validation accuracy is better than the plain MLP from the previous deck — the gap between train and test loss shrinks visibly. Dropout shines when capacity exceeds the data.
nn.Dropout(p) is a stock layer. It also handles the train vs. eval mode switch — call model.eval() and dropout becomes a no-op:
class DropoutMLP(d2l.Classifier):
def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
dropout_1, dropout_2, lr):
super().__init__()
self.save_hyperparameters()
self.net = nn.Sequential()
self.net.add(nn.Dense(num_hiddens_1, activation="relu"),
nn.Dropout(dropout_1),
nn.Dense(num_hiddens_2, activation="relu"),
nn.Dropout(dropout_2),
nn.Dense(num_outputs))
self.net.initialize()Several complementary explanations, none complete on its own:
Modern deep nets often replace dropout with BatchNorm / LayerNorm, which provides similar regularization “for free”.
But dropout remains alive and well: