Dropout

Dive into Deep Learning · §4.6

Regularizing with dropout
Randomly silence hidden units during training, and a network that would have memorized instead generalizes.

A network with room to memorize

Motivation

Modern nets are overparameterized: more weights than training points. Past the interpolation threshold, plain gradient descent can drive training error to zero by memorizing.

We want a knob that keeps capacity but discourages the model from leaning too hard on the training set.

Test error past the interpolation threshold: capacity alone does not buy generalization.

Dropout: damage the network on purpose

The idea

Srivastava, Hinton et al. (2014) gave a simple recipe:

Each training step, set each hidden unit to zero independently with probability p, then rescale the survivors by 1/(1-p). At test time, turn it off.

Counterintuitive (we actively cripple the network mid-training), yet it is among the most reliable regularizers we have, and it still ships in modern Transformers.

Why It Works

three views: a thinned net, an ensemble, broken co-adaptation

View 1: each step trains a thinned subnetwork

Why It Works

Zeroing units removes them from this step’s forward and backward pass. What is left is a thinned subnetwork; the next step samples a different one.

Here h_2 and h_5 are dropped, so the output cannot depend on them, and no single unit can dominate.

A single dropout draw: two of five hidden units zeroed, leaving a thinned network.

View 2: an exponentially large ensemble

Why It Works

A net with n hidden units has 2^n possible masks, so 2^n thinned subnetworks, all sharing one set of weights. Today’s model has two 256-unit layers: 2^{512} \approx 10^{154} subnetworks.

Train: sample one mask per step; the update nudges the shared weights to help that subnetwork.
Test: run the full net with dropout off, which approximates averaging all 2^n subnetworks.

Cheap model averaging: ensembles average away their members’ idiosyncrasies, so we expect variance reduction. The weight-scaling rule is exact only for a single linear layer; in deeper nets the full pass computes closer to a geometric mean of the subnetworks’ predictions.

View 3: noise breaks co-adaptation

Why It Works

Because no unit can count on any specific partner being present, each is pushed to learn a feature that is useful on its own:

Anti-co-adaptation: robust, redundant features instead of features that only work in specific combinations.
Smoothness: Bishop (1995) showed that input-noise injection is equivalent to a smoothness (Tikhonov) penalty on the learned function; dropout is the same idea moved inside the network.

Three lenses, one mechanism: structured noise during training.

The arithmetic: keep the expectation

Why It Works

Replace each activation h with the random variable

h' = \begin{cases} 0 & \text{with probability } p, \\[2pt] \dfrac{h}{1 - p} & \text{otherwise.} \end{cases}

The factor 1/(1-p) is the unique constant that keeps \mathbb{E}[h'] = p\cdot 0 + (1-p)\dfrac{h}{1-p} = h.

Rescaling during training is inverted dropout, what every modern framework implements. The original 2014 formulation instead multiplied the weights by 1-p at test time, equivalent in expectation, but inverting moves all bookkeeping into training, which is exactly why a Dropout layer can be a no-op in eval.

From Scratch

mask, rescale, and drop in the forward pass

A dropout layer in three lines

From Scratch

Sample a Bernoulli keep-mask from a uniform draw, multiply, then rescale the survivors by 1/(1-p) to restore the expectation:

def dropout_layer(X, dropout):
    assert 0 <= dropout <= 1
    if dropout == 1: return torch.zeros_like(X)
    mask = (torch.rand_like(X) > dropout).to(X.dtype)
    return mask * X / (1.0 - dropout)

Sanity check on a 2×8 input

From Scratch

X = torch.arange(16, dtype = torch.float32).reshape((2, 8))
print('dropout_p = 0:', dropout_layer(X, 0))
print('dropout_p = 0.5:', dropout_layer(X, 0.5))
print('dropout_p = 1:', dropout_layer(X, 1))

dropout_p = 0: tensor([[ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11., 12., 13., 14., 15.]])
dropout_p = 0.5: tensor([[ 0.,  0.,  0.,  6.,  8., 10., 12.,  0.],
        [ 0., 18., 20., 22., 24.,  0., 28.,  0.]])
dropout_p = 1: tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.]])

p = 0 → identity, nothing dropped.
p = 0.5 → about half the entries zero, survivors doubled (1/(1-0.5)=2).
p = 1 → everything dropped (degenerate).

Where dropout goes in an MLP

From Scratch

Apply it to each hidden layer’s output, after the activation:

Linear → ReLU → Dropout → Linear → ReLU → Dropout → Linear

Convention: a smaller rate near the input (low-level features must stay reliable), larger deeper in. Active in training only.

Dropout sits on the hidden activations of the MLP.

The model: two hidden layers, dropout gated on training

From Scratch

dropout_layer slots into forward right after each hidden activation, guarded by the training flag so evaluation always runs the full, unmasked network:

class DropoutMLPScratch(d2l.Classifier):
    def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
                 dropout_1, dropout_2, lr):
        super().__init__()
        self.save_hyperparameters()
        self.lin1 = nn.LazyLinear(num_hiddens_1)
        self.lin2 = nn.LazyLinear(num_hiddens_2)
        self.lin3 = nn.LazyLinear(num_outputs)
        self.relu = nn.ReLU()

    def forward(self, X):
        H1 = self.relu(self.lin1(X.reshape((X.shape[0], -1))))
        if self.training:  
            H1 = dropout_layer(H1, self.dropout_1)
        H2 = self.relu(self.lin2(H1))
        if self.training:
            H2 = dropout_layer(H2, self.dropout_2)
        return self.lin3(H2)

The payoff: the train/val gap stays shut

From Scratch · payoff

Two 256-unit hidden layers, dropout 0.2 after the first and 0.5 after the second (the gentler-near-the-input convention in action), on Fashion-MNIST:

The train and validation curves track closely across 30 epochs: the gap a plain 256-256 MLP would open up is held in check.

Concise

one stock layer, train/eval handled for you

Just add a Dropout layer

Concise

nn.Dropout(p) is a stock layer that also knows the train vs. eval switch: in eval mode it becomes a no-op, with no rescaling needed.

class DropoutMLP(d2l.Classifier):
    def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
                 dropout_1, dropout_2, lr):
        super().__init__()
        self.save_hyperparameters()
        self.net = nn.Sequential(
            nn.Flatten(), nn.LazyLinear(num_hiddens_1), nn.ReLU(), 
            nn.Dropout(dropout_1), nn.LazyLinear(num_hiddens_2), nn.ReLU(), 
            nn.Dropout(dropout_2), nn.LazyLinear(num_outputs))

Train the concise model

Concise

Same hyperparameters, same result: the layer does the masking and rescaling internally:

model = DropoutMLP(**hparams)
trainer.fit(model, data)

Dropout today

Currency

Dropout was transformative for the dense vision nets of the mid-2010s; its role has since narrowed.

CNNs mostly replace it with batch norm, which supplies similar noise-driven regularization.
Transformers use it lightly (rates 0.0–0.1), often only on the classifier head.

The two combine poorly: batch norm’s running statistics, accumulated while dropout perturbs the activations’ variance, mismatch what it sees at eval time: don’t place dropout before a BN layer (Li et al., 2019).

Still a cheap, reliable regularizer that combines well with weight decay and data augmentation, and the conceptual seed of a whole family of stochastic-regularization methods.

Summary

Wrap-up

Dropout zeros each hidden unit with probability p during training, then rescales survivors by 1/(1-p).
The rescaling keeps \mathbb{E}[h']=h (inverted dropout), so test-time code is unchanged.
Off at test time: the full network runs, unmasked.

Place it after the activation, before the next linear layer; gentler near the input (0.2, then 0.5 here).
Three views: a thinned subnetwork each step, an implicit 2^n ensemble, broken co-adaptation.
nn.Dropout(p) does it all and respects train/eval.

Exercise 5 flips the switch: keep dropout on at test time, average 20 passes, and you get uncertainty estimates (MC dropout). Next (the Kaggle house-prices section): everything in this chapter, deployed on a Kaggle competition.