Dropout

Dive into Deep Learning · §4.6

Regularizing with dropout
Randomly silence hidden units during training, and a network that would have memorized instead generalizes.

A network with room to memorize

Motivation

Modern nets are overparameterized: more weights than training points. Past the interpolation threshold, plain gradient descent can drive training error to zero by memorizing.

We want a knob that keeps capacity but discourages the model from leaning too hard on the training set.

Test error past the interpolation threshold: capacity alone does not buy generalization.

Dropout: damage the network on purpose

The idea

Srivastava, Hinton et al. (2014) gave a simple recipe:

Each training step, set each hidden unit to zero independently with probability p, then rescale the survivors by 1/(1-p). At test time, turn it off.

Counterintuitive (we actively cripple the network mid-training), yet it is among the most reliable regularizers we have, and it still ships in modern Transformers.

Why It Works

three views: a thinned net, an ensemble, broken co-adaptation

View 1: each step trains a thinned subnetwork

Why It Works

Zeroing units removes them from this step’s forward and backward pass. What is left is a thinned subnetwork; the next step samples a different one.

Here h_2 and h_5 are dropped, so the output cannot depend on them, and no single unit can dominate.

A single dropout draw: two of five hidden units zeroed, leaving a thinned network.

View 2: an exponentially large ensemble

Why It Works

A net with n hidden units has 2^n possible masks, so 2^n thinned subnetworks, all sharing one set of weights. Today’s model has two 256-unit layers: 2^{512} \approx 10^{154} subnetworks.

Train: sample one mask per step; the update nudges the shared weights to help that subnetwork.
Test: run the full net with dropout off, which approximates averaging all 2^n subnetworks.

Cheap model averaging: ensembles average away their members’ idiosyncrasies, so we expect variance reduction. The weight-scaling rule is exact only for a single linear layer; in deeper nets the full pass computes closer to a geometric mean of the subnetworks’ predictions.

View 3: noise breaks co-adaptation

Why It Works

Because no unit can count on any specific partner being present, each is pushed to learn a feature that is useful on its own:

Anti-co-adaptation: robust, redundant features instead of features that only work in specific combinations.
Smoothness: Bishop (1995) showed that input-noise injection is equivalent to a smoothness (Tikhonov) penalty on the learned function; dropout is the same idea moved inside the network.

Three lenses, one mechanism: structured noise during training.

The arithmetic: keep the expectation

Why It Works

Replace each activation h with the random variable

h' = \begin{cases} 0 & \text{with probability } p, \\[2pt] \dfrac{h}{1 - p} & \text{otherwise.} \end{cases}

The factor 1/(1-p) is the unique constant that keeps \mathbb{E}[h'] = p\cdot 0 + (1-p)\dfrac{h}{1-p} = h.

Rescaling during training is inverted dropout, what every modern framework implements. The original 2014 formulation instead multiplied the weights by 1-p at test time, equivalent in expectation, but inverting moves all bookkeeping into training, which is exactly why a Dropout layer can be a no-op in eval.

From Scratch

mask, rescale, and drop in the forward pass

A dropout layer in three lines

From Scratch

Sample a Bernoulli keep-mask from a uniform draw, multiply, then rescale the survivors by 1/(1-p) to restore the expectation:

def dropout_layer(X, dropout, key):
    assert 0 <= dropout <= 1
    if dropout == 1: return jnp.zeros_like(X)
    mask = jax.random.uniform(key, X.shape) > dropout
    return jnp.asarray(mask, dtype=X.dtype) * X / (1.0 - dropout)

Sanity check on a 2×8 input

From Scratch

X = jnp.arange(16, dtype=jnp.float32).reshape(2, 8)
keys = jax.random.split(d2l.get_key(), 3)
print('dropout_p = 0:', dropout_layer(X, 0, keys[0]))
print('dropout_p = 0.5:', dropout_layer(X, 0.5, keys[1]))
print('dropout_p = 1:', dropout_layer(X, 1, keys[2]))

dropout_p = 0: [[ 0.  1.  2.  3.  4.  5.  6.  7.]
 [ 8.  9. 10. 11. 12. 13. 14. 15.]]
dropout_p = 0.5: [[ 0.  0.  4.  0.  0.  0. 12. 14.]
 [ 0.  0. 20. 22. 24. 26.  0. 30.]]
dropout_p = 1: [[0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]]

p = 0 → identity, nothing dropped.
p = 0.5 → about half the entries zero, survivors doubled (1/(1-0.5)=2).
p = 1 → everything dropped (degenerate).

Where dropout goes in an MLP

From Scratch

Apply it to each hidden layer’s output, after the activation:

Linear → ReLU → Dropout → Linear → ReLU → Dropout → Linear

Convention: a smaller rate near the input (low-level features must stay reliable), larger deeper in. Active in training only.

Dropout sits on the hidden activations of the MLP.

The model: two hidden layers, dropout gated on training

From Scratch

dropout_layer slots into forward right after each hidden activation, guarded by the training flag so evaluation always runs the full, unmasked network:

class DropoutMLPScratch(d2l.Classifier):
    def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
                 dropout_1, dropout_2, lr, num_inputs=784, rngs=None):
        super().__init__()
        self.save_hyperparameters(ignore=['rngs'])
        rngs = nnx.Rngs(params=d2l.get_key(), dropout=d2l.get_key()) \
            if rngs is None else rngs
        self.lin1 = nnx.Linear(num_inputs, num_hiddens_1, rngs=rngs)
        self.lin2 = nnx.Linear(num_hiddens_1, num_hiddens_2, rngs=rngs)
        self.lin3 = nnx.Linear(num_hiddens_2, num_outputs, rngs=rngs)
        self.rngs = rngs
        self.deterministic = False

    def set_view(self, *, deterministic):
        self.deterministic = deterministic

    def forward(self, X):
        H1 = nnx.relu(self.lin1(X.reshape(X.shape[0], -1)))
        if not self.deterministic:
            H1 = dropout_layer(H1, self.dropout_1, self.rngs.dropout())
        H2 = nnx.relu(self.lin2(H1))
        if not self.deterministic:
            H2 = dropout_layer(H2, self.dropout_2, self.rngs.dropout())
        return self.lin3(H2)

The payoff: the train/val gap stays shut

From Scratch · payoff

Two 256-unit hidden layers, dropout 0.2 after the first and 0.5 after the second (the gentler-near-the-input convention in action), on Fashion-MNIST:

The train and validation curves track closely across 30 epochs: the gap a plain 256-256 MLP would open up is held in check.

Concise

one stock layer, train/eval handled for you

Just add a Dropout layer

Concise

nn.Dropout(p) is a stock layer that also knows the train vs. eval switch: in eval mode it becomes a no-op, with no rescaling needed.

class DropoutMLP(d2l.Classifier):
    def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
                 dropout_1, dropout_2, lr, num_inputs=784, rngs=None):
        super().__init__()
        self.save_hyperparameters(ignore=['rngs'])
        rngs = nnx.Rngs(params=d2l.get_key(), dropout=d2l.get_key()) \
            if rngs is None else rngs
        self.lin1 = nnx.Linear(num_inputs, num_hiddens_1, rngs=rngs)
        self.drop1 = nnx.Dropout(dropout_1, rngs=rngs)
        self.lin2 = nnx.Linear(num_hiddens_1, num_hiddens_2, rngs=rngs)
        self.drop2 = nnx.Dropout(dropout_2, rngs=rngs)
        self.lin3 = nnx.Linear(num_hiddens_2, num_outputs, rngs=rngs)

    def forward(self, X):
        X = X.reshape((X.shape[0], -1))
        X = self.drop1(nnx.relu(self.lin1(X)))
        X = self.drop2(nnx.relu(self.lin2(X)))
        return self.lin3(X)

Train the concise model

Concise

Same hyperparameters, same result: the layer does the masking and rescaling internally:

model = DropoutMLP(**hparams)
trainer.fit(model, data)

Dropout today

Currency

Dropout was transformative for the dense vision nets of the mid-2010s; its role has since narrowed.

CNNs mostly replace it with batch norm, which supplies similar noise-driven regularization.
Transformers use it lightly (rates 0.0–0.1), often only on the classifier head.

The two combine poorly: batch norm’s running statistics, accumulated while dropout perturbs the activations’ variance, mismatch what it sees at eval time: don’t place dropout before a BN layer (Li et al., 2019).

Still a cheap, reliable regularizer that combines well with weight decay and data augmentation, and the conceptual seed of a whole family of stochastic-regularization methods.

Summary

Wrap-up

Dropout zeros each hidden unit with probability p during training, then rescales survivors by 1/(1-p).
The rescaling keeps \mathbb{E}[h']=h (inverted dropout), so test-time code is unchanged.
Off at test time: the full network runs, unmasked.

Place it after the activation, before the next linear layer; gentler near the input (0.2, then 0.5 here).
Three views: a thinned subnetwork each step, an implicit 2^n ensemble, broken co-adaptation.
nn.Dropout(p) does it all and respects train/eval.

Exercise 5 flips the switch: keep dropout on at test time, average 20 passes, and you get uncertainty estimates (MC dropout). Next (the Kaggle house-prices section): everything in this chapter, deployed on a Kaggle competition.