from d2l import tensorflow as d2l
import tensorflow as tfDropout (Srivastava, Hinton et al., 2014) is the simplest and most widely used regularizer for neural networks:
During training, set each hidden unit to zero independently with probability p. Rescale the survivors by 1/(1-p). Turn it off at test time.
Counterintuitive — we actively damage the network mid-training — but the trick is rock-solid. It still ships in modern Transformers (~10% rate standard).
Modern networks are overparameterized — more weights than training examples. Without a regularizer, gradient descent happily memorizes the training set.
Two complementary reasons dropout helps:
On every minibatch we randomly zero a fraction of hidden units; the network on this iteration is a thinned subnetwork. Across iterations we sample many subnetworks:
Two of the five hidden units zeroed by a single dropout draw. Each iteration samples a different subset.
At test time dropout is off — we use the full network. Effectively we average exponentially many thinned subnetworks (a kind of cheap ensemble).
Per hidden unit h, replace with
h' = \begin{cases} 0 & \text{with probability } p, \\ \dfrac{h}{1 - p} & \text{otherwise.} \end{cases}
The rescaling 1/(1-p) is what makes \mathbb{E}[h'] = h. Without it, expected activations shrink by (1-p) during training but recover their full scale at test time → train/test mismatch.
This is “inverted dropout”; the version every modern framework uses.
Sample a Bernoulli mask, multiply, rescale:
Quick check on a 2×8 input:
dropout_p = 0: tf.Tensor(
[[ 0. 1. 2. 3. 4. 5. 6. 7.]
[ 8. 9. 10. 11. 12. 13. 14. 15.]], shape=(2, 8), dtype=float32)
dropout_p = 0.5: tf.Tensor(
[[ 0. 2. 4. 6. 8. 0. 12. 14.]
[16. 18. 20. 0. 0. 26. 28. 30.]], shape=(2, 8), dtype=float32)
dropout_p = 1: tf.Tensor(
[[0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0.]], shape=(2, 8), dtype=float32)
After the activation, before the next linear layer:
Linear → ReLU → Dropout(p₁) → Linear → ReLU → Dropout(p₂) → Linear
Convention: less on early layers (low-level features need to be reliable), more later (high-level features overfit).
Typical values:
class DropoutMLPScratch(d2l.Classifier):
def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
dropout_1, dropout_2, lr):
super().__init__()
self.save_hyperparameters()
self.lin1 = tf.keras.layers.Dense(num_hiddens_1, activation='relu')
self.lin2 = tf.keras.layers.Dense(num_hiddens_2, activation='relu')
self.lin3 = tf.keras.layers.Dense(num_outputs)
def forward(self, X):
H1 = self.lin1(tf.reshape(X, (tf.shape(X)[0], -1)))
if self.training:
H1 = dropout_layer(H1, self.dropout_1)
H2 = self.lin2(H1)
if self.training:
H2 = dropout_layer(H2, self.dropout_2)
return self.lin3(H2)Two hidden layers (256 each), dropout 0.5 between them:
Validation accuracy is better than the plain MLP from the previous deck — the gap between train and test loss shrinks visibly. Dropout shines when capacity exceeds the data.
nn.Dropout(p) is a stock layer. It also handles the train vs. eval mode switch — call model.eval() and dropout becomes a no-op:
class DropoutMLP(d2l.Classifier):
def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
dropout_1, dropout_2, lr):
super().__init__()
self.save_hyperparameters()
self.net = tf.keras.models.Sequential([
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(num_hiddens_1, activation=tf.nn.relu),
tf.keras.layers.Dropout(dropout_1),
tf.keras.layers.Dense(num_hiddens_2, activation=tf.nn.relu),
tf.keras.layers.Dropout(dropout_2),
tf.keras.layers.Dense(num_outputs)])Several complementary explanations, none complete on its own:
Modern deep nets often replace dropout with BatchNorm / LayerNorm, which provides similar regularization “for free”.
But dropout remains alive and well: