Linear Regression Implementation from Scratch

Linear regression from scratch

End-to-end linear regression with nothing but tensor ops:

  1. Model — a Module with w and b parameters and a forward.
  2. Loss — squared error.
  3. Optimizer — minibatch SGD, written by hand.
  4. Training loop — the Trainer’s fit_epoch, also from scratch.

The next chapter does the same with nn.LazyLinear + MSELoss + SGD in two lines. This one shows what those two lines hide.

Parameters

Initialize w randomly (small Gaussian), b at zero:

%matplotlib inline
from d2l import mxnet as d2l
from mxnet import autograd, np, npx
npx.set_np()
class LinearRegressionScratch(d2l.Module):
    """The linear regression model implemented from scratch."""
    def __init__(self, num_inputs, lr, sigma=0.01):
        super().__init__()
        self.save_hyperparameters()
        self.w = d2l.normal(0, sigma, (num_inputs, 1))
        self.b = d2l.zeros(1)
        self.w.attach_grad()
        self.b.attach_grad()

requires_grad=True (or the framework equivalent) so autograd tracks them.

Forward pass

The model is one matrix-vector product plus a bias — \hat{\mathbf{y}} = \mathbf{X}\mathbf{w} + b:

@d2l.add_to_class(LinearRegressionScratch)
def forward(self, X):
    return d2l.matmul(X, self.w) + self.b

Loss

Squared error per example, averaged across the batch:

\ell(\hat{y}, y) = \tfrac{1}{2}(\hat{y} - y)^2.

@d2l.add_to_class(LinearRegressionScratch)
def loss(self, y_hat, y):
    l = (y_hat - y) ** 2 / 2
    return d2l.reduce_mean(l)

Optimizer: minibatch SGD

The update rule \theta \leftarrow \theta - \eta \nabla_\theta L written out by hand:

class SGD(d2l.HyperParameters):
    """Minibatch stochastic gradient descent."""
    def __init__(self, params, lr):
        self.save_hyperparameters()

    def step(self, _):
        for param in self.params:
            param -= self.lr * param.grad

The model class wires it up in configure_optimizers:

@d2l.add_to_class(LinearRegressionScratch)
def configure_optimizers(self):
    return SGD([self.w, self.b], self.lr)

Training step

What happens once per minibatch — forward, loss, backward, step:

@d2l.add_to_class(d2l.Trainer)
def prepare_batch(self, batch):
    return batch

The whole epoch

The Trainer walks the train and val loaders once per epoch, calling the steps:

@d2l.add_to_class(d2l.Trainer)
def fit_epoch(self):
    for batch in self.train_dataloader:
        with autograd.record():
            loss = self.model.training_step(self.prepare_batch(batch))
        loss.backward()
        if self.gradient_clip_val > 0:
            self.clip_gradients(self.gradient_clip_val, self.model)
        self.optim.step(1)
        self.train_batch_idx += 1
    if self.val_dataloader is None:
        return
    for batch in self.val_dataloader:        
        self.model.validation_step(self.prepare_batch(batch))
        self.val_batch_idx += 1

Run training on the synthetic dataset:

model = LinearRegressionScratch(2, lr=0.03)
data = d2l.SyntheticRegressionData(w=d2l.tensor([2, -3.4]), b=4.2)
trainer = d2l.Trainer(max_epochs=10)
trainer.fit(model, data)

Did it learn the right thing?

We know the true w and b — compare with the learned values:

print(f'error in estimating w: {data.w - d2l.reshape(model.w, data.w.shape)}')
print(f'error in estimating b: {data.b - model.b}')

Tiny differences come from finite training data + noise; tighter than that requires either more data or a better optimizer.

Recap

  • A Module for linear regression boils down to __init__, forward, loss, configure_optimizers.
  • A hand-rolled SGD is ~10 lines.
  • The Trainer.fit_epoch glue is what pytorch / tensorflow / jax / mxnet’s training APIs hide.
  • Synthetic data lets us check that the optimizer recovered the ground-truth parameters.