A regression that overfits

Weight Decay

Weight decay limits overfitting

The simplest regularization technique in the book — add a penalty on the squared norm of the weights:

L_{\text{reg}}(\mathbf{w}, b) = L(\mathbf{w}, b) + \frac{\lambda}{2} \|\mathbf{w}\|_2^2.

The gradient gains a +\lambda\mathbf{w} term, so the update subtracts \eta\lambda\mathbf{w} and weights decay toward zero each step. One hyperparameter \lambda (wd in code) controls how much.

Why? An overparameterized model fit to a tiny dataset memorizes the noise. Capping how big the weights can grow keeps the fit tame.

Setup

%matplotlib inline
from d2l import mxnet as d2l
from mxnet import autograd, gluon, init, np, npx
from mxnet.gluon import nn
npx.set_np()

Generate a tiny dataset (20 train, 100 val) where the truth has 200 inputs but only a small total signal:

y = 0.05 + \sum_{i=1}^{200} 0.01\,x_i + \epsilon, \quad \epsilon \sim \mathcal{N}(0, 0.01^2).

Far more parameters than data — perfect overfitting setup:

class Data(d2l.DataModule):
    def __init__(self, num_train, num_val, num_inputs, batch_size):
        self.save_hyperparameters()                
        n = num_train + num_val 
        self.X = d2l.randn(n, num_inputs)
        noise = d2l.randn(n, 1) * 0.01
        w, b = d2l.ones((num_inputs, 1)) * 0.01, 0.05
        self.y = d2l.matmul(self.X, w) + b + noise

    def get_dataloader(self, train):
        i = slice(0, self.num_train) if train else slice(self.num_train, None)
        return self.get_tensorloader([self.X, self.y], train, i)

The L2 penalty

The penalty itself is one line:

def l2_penalty(w):
    return d2l.reduce_sum(w**2) / 2

Adding weight decay to the model

Subclass the from-scratch linear regression to add the penalty into the loss:

class WeightDecayScratch(d2l.LinearRegressionScratch):
    def __init__(self, num_inputs, lambd, lr, sigma=0.01):
        super().__init__(num_inputs, lr, sigma)
        self.save_hyperparameters()
        
    def loss(self, y_hat, y):
        return (super().loss(y_hat, y) +
                self.lambd * l2_penalty(self.w))
data = Data(num_train=20, num_val=100, num_inputs=200, batch_size=5)
trainer = d2l.Trainer(max_epochs=10)

def train_scratch(lambd):    
    model = WeightDecayScratch(num_inputs=200, lambd=lambd, lr=0.01)
    model.board.yscale='log'
    trainer.fit(model, data)
    print('L2 norm of w:', float(l2_penalty(model.w)))

Without regularization → overfit

\lambda = 0: the model fits the 20 training examples almost perfectly while validation loss explodes:

train_scratch(0)

With weight decay → controlled

\lambda = 3: training loss is higher, but validation loss is much lower. Generalization wins:

train_scratch(3)

The training-vs-validation gap is the regularization payoff.

The framework version

Most optimizers accept a weight_decay argument that adds the \lambda \mathbf{w} gradient term automatically — same idea, no manual penalty code:

class WeightDecay(d2l.LinearRegression):
    def __init__(self, wd, lr):
        super().__init__(lr)
        self.save_hyperparameters()
        self.wd = wd
        
    def configure_optimizers(self):
        for p in self.collect_params('.*bias').values():
            p.wd_mult = 0
        return gluon.Trainer(self.collect_params(),
                             'sgd', 
                             {'learning_rate': self.lr, 'wd': self.wd})
model = WeightDecay(wd=3, lr=0.01)
model.board.yscale='log'
trainer.fit(model, data)

print('L2 norm of w:', float(l2_penalty(model.get_w_b()[0])))

(Note: framework weight_decay typically applies to all parameters; if you don’t want bias decay, exclude it explicitly via parameter groups.)

Recap

  • \ell_2-regularized loss = original loss + \frac{\lambda}{2} \|\mathbf{w}\|_2^2.
  • Per-step effect: gradient gets +\lambda \mathbf{w}, so the update shrinks weights by subtracting \eta\lambda\mathbf{w}.
  • Hyperparameter \lambda (“wd” in code) trades training fit for generalization. Tune it on a validation set.
  • Frameworks expose this as the optimizer’s weight_decay= arg.
  • The same idea generalizes — \ell_1 (sparsity), elastic net, dropout, etc. — but \ell_2 is the default first try.