Linear Regression Implementation from Scratch

Linear regression from scratch

End-to-end linear regression with nothing but tensor ops:

  1. Model — a Module with w and b parameters and a forward.
  2. Loss — squared error.
  3. Optimizer — minibatch SGD, written by hand.
  4. Training loop — the Trainer’s fit_epoch, also from scratch.

The next chapter does the same with nn.LazyLinear + MSELoss + SGD in two lines. This one shows what those two lines hide.

Parameters

Initialize w randomly (small Gaussian), b at zero:

%matplotlib inline
from d2l import torch as d2l
import torch
class LinearRegressionScratch(d2l.Module):
    """The linear regression model implemented from scratch."""
    def __init__(self, num_inputs, lr, sigma=0.01):
        super().__init__()
        self.save_hyperparameters()
        self.w = d2l.normal(0, sigma, (num_inputs, 1), requires_grad=True)
        self.b = d2l.zeros(1, requires_grad=True)

requires_grad=True (or the framework equivalent) so autograd tracks them.

Forward pass

The model is one matrix-vector product plus a bias — \hat{\mathbf{y}} = \mathbf{X}\mathbf{w} + b:

@d2l.add_to_class(LinearRegressionScratch)
def forward(self, X):
    return d2l.matmul(X, self.w) + self.b

Loss

Squared error per example, averaged across the batch:

\ell(\hat{y}, y) = \tfrac{1}{2}(\hat{y} - y)^2.

@d2l.add_to_class(LinearRegressionScratch)
def loss(self, y_hat, y):
    l = (y_hat - y) ** 2 / 2
    return d2l.reduce_mean(l)

Optimizer: minibatch SGD

The update rule \theta \leftarrow \theta - \eta \nabla_\theta L written out by hand:

class SGD(d2l.HyperParameters):
    """Minibatch stochastic gradient descent."""
    def __init__(self, params, lr):
        self.save_hyperparameters()

    def step(self):
        for param in self.params:
            param -= self.lr * param.grad

    def zero_grad(self):
        for param in self.params:
            if param.grad is not None:
                param.grad.zero_()

The model class wires it up in configure_optimizers:

@d2l.add_to_class(LinearRegressionScratch)
def configure_optimizers(self):
    return SGD([self.w, self.b], self.lr)

Training step

What happens once per minibatch — forward, loss, backward, step:

@d2l.add_to_class(d2l.Trainer)
def prepare_batch(self, batch):
    return batch

The whole epoch

The Trainer walks the train and val loaders once per epoch, calling the steps:

@d2l.add_to_class(d2l.Trainer)
def fit_epoch(self):
    self.model.train()
    for batch in self.train_dataloader:
        loss = self.model.training_step(self.prepare_batch(batch))
        self.optim.zero_grad()
        loss.backward()
        if self.gradient_clip_val > 0:  # To be discussed later
            self.clip_gradients(self.gradient_clip_val, self.model)
        # The `no_grad` only needs to wrap the parameter update; the
        # scratch `SGD.step` does an in-place `param -= lr * grad`,
        # which would otherwise be flagged as a leaf-tensor mutation.
        with torch.no_grad():
            self.optim.step()
        self.train_batch_idx += 1
    if self.val_dataloader is None:
        return
    self.model.eval()
    for batch in self.val_dataloader:
        with torch.no_grad():
            self.model.validation_step(self.prepare_batch(batch))
        self.val_batch_idx += 1

Run training on the synthetic dataset:

model = LinearRegressionScratch(2, lr=0.03)
data = d2l.SyntheticRegressionData(w=d2l.tensor([2, -3.4]), b=4.2)
trainer = d2l.Trainer(max_epochs=10)
trainer.fit(model, data)

Did it learn the right thing?

We know the true w and b — compare with the learned values:

with torch.no_grad():
    print(f'error in estimating w: {data.w - d2l.reshape(model.w, data.w.shape)}')
    print(f'error in estimating b: {data.b - model.b}')
error in estimating w: tensor([ 0.0006, -0.0003])
error in estimating b: tensor([-0.0005])

Tiny differences come from finite training data + noise; tighter than that requires either more data or a better optimizer.

Recap

  • A Module for linear regression boils down to __init__, forward, loss, configure_optimizers.
  • A hand-rolled SGD is ~10 lines.
  • The Trainer.fit_epoch glue is what pytorch / tensorflow / jax / mxnet’s training APIs hide.
  • Synthetic data lets us check that the optimizer recovered the ground-truth parameters.