Concise Implementation of Linear Regression

Dive into Deep Learning · §2.5

The same model, the concise way
batteries-included layers, losses, and optimizers replace the hand-rolled parts.

From hand-rolled to high-level

Motivation

Last section we wrote every piece by hand: the weight vector, the forward pass, the squared error, the update step.

Those pieces are so universal that frameworks ship them, tuned and tested. We swap each one for its built-in counterpart:

Layer replaces w, b · loss replaces our squared error · optimizer replaces the update loop.

By hand	Built-in
`w`, `b`	a layer
MSE math	a loss
update step	an optimizer

The Model

a single linear layer

The layer already is the model

The Model

What we hand-rolled as w, b, and a matrix–vector product, every framework ships as a fully connected layer: each input wired to the one output: exactly the picture of linear regression.

The layer owns its parameters. We no longer allocate them, initialize them, or even know their shapes ahead of time.

One fully connected layer with a single output is linear regression.

One layer, not a weight vector

The Model

LazyLinear(1) is the whole model. The lazy variant defers the input dimension until the first forward pass. Initialize its parameters only after that first pass:

class LinearRegression(d2l.Module):
    """The linear regression model implemented with high-level APIs."""
    def __init__(self, lr):
        super().__init__()
        self.save_hyperparameters()
        self.net = nn.LazyLinear(1)

Lazy shape inference pays off in deep nets (conv layers, variable-length sequences) where the input size is tedious to work out.

The forward pass is a one-liner

The Model

forward just calls the layer. All the matrix–vector arithmetic we wrote by hand now lives inside it:

@d2l.add_to_class(LinearRegression)
def forward(self, X):
    return self.net(X)

Loss & Optimizer

two more pieces, off the shelf

Loss: built-in mean squared error

Loss & Optimizer

The framework’s MSE replaces our hand-written squared error:

@d2l.add_to_class(LinearRegression)
def loss(self, y_hat, y):
    fn = nn.MSELoss()
    return fn(y_hat, y)

It omits the \tfrac{1}{2} factor we used by hand, and averages over the minibatch by default.

Optimizer: minibatch SGD in one call

Loss & Optimizer

The update loop becomes a single optimizer object, handed the parameters and the learning rate:

@d2l.add_to_class(LinearRegression)
def configure_optimizers(self):
    return torch.optim.SGD(self.parameters(), self.lr)

The same optim/Trainer family also gives momentum, Adam, and more by swapping one line.

Training

the scaffold never changed

The same Trainer drives it all

Training

Our Trainer, Module, and DataModule from the object-oriented-design section don’t care that the model is now a built-in layer.

The training loop is identical to the from-scratch version.

Fit: same data, same curve, a fraction of the code

Training

Same synthetic data, same ten epochs, same fit call as the linear-regression-from-scratch section:

model = LinearRegression(lr=0.03)
data = d2l.SyntheticRegressionData(w=d2l.tensor([2, -3.4]), b=4.2)
# Materialize lazy parameters before replacing their default initialization.
model(data.X[:1])
with torch.no_grad():
    model.net.weight.normal_(0, 0.01)
    model.net.bias.fill_(0)
trainer = d2l.Trainer(max_epochs=10)
trainer.fit(model, data)

Nothing about the training run can tell the two implementations apart: only the amount of code we wrote changed.

Where the parameters live now

Training · payoff

They no longer hang off our class as self.w, self.b; they live inside the layer, so get_w_b reaches through net:

@d2l.add_to_class(LinearRegression)
def get_w_b(self):
    return (self.net.weight.detach(), self.net.bias.detach())
w, b = model.get_w_b()

print(f'error in estimating w: {data.w - d2l.reshape(w, data.w.shape)}')
print(f'error in estimating b: {data.b - b}')

error in estimating w: tensor([-0.0003,  0.0008])
error in estimating b: tensor([0.0004])

Same verdict as the linear-regression-from-scratch section: the true \mathbf{w}^* = [2,-3.4], b^* = 4.2 recovered to a few 10^{-4}. The built-in pieces really do compute the same thing our hand-rolled ones did.

Summary

Wrap-up

From scratch showed what happens; concise is what we actually use day to day.
A single layer stands in for w, b; a built-in loss and optimizer replace the rest.

The Module / Trainer / DataModule scaffold is unchanged; only the model’s internals got shorter.
Same minibatch loop, same convergence: ~5 lines of model code, error order 10^{-4}.