Linear Regression Implementation from Scratch

Linear regression from scratch

End-to-end linear regression with nothing but tensor ops:

  1. Model — a Module with w and b parameters and a forward.
  2. Loss — squared error.
  3. Optimizer — minibatch SGD, written by hand.
  4. Training loop — the Trainer’s fit_epoch, also from scratch.

The next chapter does the same with nn.LazyLinear + MSELoss + SGD in two lines. This one shows what those two lines hide.

Parameters

Initialize w randomly (small Gaussian), b at zero:

%matplotlib inline
from d2l import tensorflow as d2l
import tensorflow as tf
class LinearRegressionScratch(d2l.Module):
    """The linear regression model implemented from scratch."""
    def __init__(self, num_inputs, lr, sigma=0.01):
        super().__init__()
        self.save_hyperparameters()
        w = tf.random.normal((num_inputs, 1), mean=0, stddev=0.01)
        b = tf.zeros(1)
        self.w = tf.Variable(w, trainable=True)
        self.b = tf.Variable(b, trainable=True)

requires_grad=True (or the framework equivalent) so autograd tracks them.

Forward pass

The model is one matrix-vector product plus a bias — \hat{\mathbf{y}} = \mathbf{X}\mathbf{w} + b:

@d2l.add_to_class(LinearRegressionScratch)
def forward(self, X):
    return d2l.matmul(X, self.w) + self.b

Loss

Squared error per example, averaged across the batch:

\ell(\hat{y}, y) = \tfrac{1}{2}(\hat{y} - y)^2.

@d2l.add_to_class(LinearRegressionScratch)
def loss(self, y_hat, y):
    l = (y_hat - y) ** 2 / 2
    return d2l.reduce_mean(l)

Optimizer: minibatch SGD

The update rule \theta \leftarrow \theta - \eta \nabla_\theta L written out by hand:

class SGD(d2l.HyperParameters):
    """Minibatch stochastic gradient descent."""
    def __init__(self, lr):
        self.save_hyperparameters()

    def apply_gradients(self, grads_and_vars):
        for grad, param in grads_and_vars:
            param.assign_sub(self.lr * grad)

The model class wires it up in configure_optimizers:

@d2l.add_to_class(LinearRegressionScratch)
def configure_optimizers(self):
    return SGD(self.lr)

Training step

What happens once per minibatch — forward, loss, backward, step:

@d2l.add_to_class(d2l.Trainer)
def prepare_batch(self, batch):
    return batch

The whole epoch

The Trainer walks the train and val loaders once per epoch, calling the steps:

@d2l.add_to_class(d2l.Trainer)
def _compile_steps(self):
    model, optim = self.model, self.optim
    grad_clip = self.gradient_clip_val
    for batch in self.train_dataloader:
        model(*self.prepare_batch(batch)[:-1], training=True)
        break

    def train_step(batch):
        with tf.GradientTape() as tape:
            loss = model.loss(model(*batch[:-1], training=True),
                              batch[-1])
        params = model.trainable_variables
        if not params:
            params = list(tape.watched_variables())
        grads = tape.gradient(loss, params)
        if grad_clip > 0:
            grads = self.clip_gradients(grad_clip, grads)
        optim.apply_gradients(zip(grads, params))
        return loss

    def val_step(batch):
        return model(*batch[:-1], training=False)

    train_step = tf.function(train_step, reduce_retracing=True)
    val_step = tf.function(val_step, reduce_retracing=True)

    self._train_step = train_step
    self._val_step = val_step

@d2l.add_to_class(d2l.Trainer)
def fit_epoch(self):
    self.model.training = True
    for batch in self.train_dataloader:
        loss = self._train_step(self.prepare_batch(batch))
        self.model._report_train(loss)
        self.train_batch_idx += 1
    if self.val_dataloader is None:
        return
    self.model.training = False
    for batch in self.val_dataloader:
        b = self.prepare_batch(batch)
        y_hat = self._val_step(b)
        self.model._report_val(y_hat, b)
        self.val_batch_idx += 1

Run training on the synthetic dataset:

model = LinearRegressionScratch(2, lr=0.03)
data = d2l.SyntheticRegressionData(w=d2l.tensor([2, -3.4]), b=4.2)
trainer = d2l.Trainer(max_epochs=10)
trainer.fit(model, data)

Did it learn the right thing?

We know the true w and b — compare with the learned values:

print(f'error in estimating w: {data.w - d2l.reshape(model.w, data.w.shape)}')
print(f'error in estimating b: {data.b - model.b}')
error in estimating w: [0.00043726 0.00010633]
error in estimating b: [0.00016832]

Tiny differences come from finite training data + noise; tighter than that requires either more data or a better optimizer.

Recap

  • A Module for linear regression boils down to __init__, forward, loss, configure_optimizers.
  • A hand-rolled SGD is ~10 lines.
  • The Trainer.fit_epoch glue is what pytorch / tensorflow / jax / mxnet’s training APIs hide.
  • Synthetic data lets us check that the optimizer recovered the ground-truth parameters.