Implementation of Multilayer Perceptrons

Dive into Deep Learning · §4.2

Implementing a multilayer perceptron
One hidden layer, two ways to build it, reaching \approx 0.87 validation accuracy.

The whole model on one slide

What we are building

A batched image is flattened to 784 features, mapped by an affine layer + ReLU to a 256-dim hidden vector, then by a second affine layer to 10 logits.

One hidden layer, one nonlinearity.
Same loss, loaders, and Trainer as softmax regression.

That ReLU between the two affine maps, together with the hidden layer, is the entire difference from a linear classifier like softmax regression.

Why these sizes?

Design choices

Fashion-MNIST: 784 inputs, 10 classes. We pick 256 hidden units, giving \approx 200\text{k} parameters.

Width 256: big enough to fit the data, small enough to train in seconds.
A power of 2 because matmul kernels run more efficiently at those widths (nothing breaks at 250).
One hidden layer suffices here; spatial structure waits for convolutions.

Depth, width, and learning rate are hyperparameters: chosen by hand, not learned.

From Scratch

parameters, ReLU, and forward by hand

Parameters: two weights, two biases

From Scratch

class MLPScratch(d2l.Classifier):
    def __init__(self, num_inputs, num_outputs, num_hiddens, lr, sigma=0.01):
        super().__init__()
        self.save_hyperparameters()
        self.W1 = nn.Parameter(torch.randn(num_inputs, num_hiddens) * sigma)
        self.b1 = nn.Parameter(torch.zeros(num_hiddens))
        self.W2 = nn.Parameter(torch.randn(num_hiddens, num_outputs) * sigma)
        self.b2 = nn.Parameter(torch.zeros(num_outputs))

Weights start as small Gaussian noise (\sigma=0.01) to break symmetry, biases at zero:

\mathbf{W}^{(1)}\!\in\mathbb{R}^{784\times256},\; \mathbf{b}^{(1)}\!\in\mathbb{R}^{256} \mathbf{W}^{(2)}\!\in\mathbb{R}^{256\times10},\; \mathbf{b}^{(2)}\!\in\mathbb{R}^{10}

784\cdot256 + 256 + 256\cdot10 + 10 = 203{,}530 learnable numbers.

ReLU, by hand

From Scratch

To see there is no magic, we write the activation ourselves rather than calling the built-in. It is just \max(x, 0), applied elementwise:

def relu(X):
    return torch.maximum(X, torch.zeros_like(X))

Zero out the negatives, pass the positives through. That one kink is what lets a stack of affine maps bend.

The forward pass is two lines

From Scratch

Flatten, then an affine-ReLU, then a second affine, exactly the data flow in the diagram:

\mathbf{H} = \mathrm{ReLU}(\mathbf{X}\mathbf{W}^{(1)} + \mathbf{b}^{(1)}), \qquad \mathbf{O} = \mathbf{H}\mathbf{W}^{(2)} + \mathbf{b}^{(2)}.

def forward(self, X):
    X = d2l.reshape(X, (-1, self.num_inputs))
    H = relu(d2l.matmul(X, self.W1) + self.b1)
    return d2l.matmul(H, self.W2) + self.b2

Result: ≈0.87 validation accuracy

From Scratch

The loss, the loaders, and the Trainer are unchanged from softmax regression. Only the model class is new:

Validation accuracy typically settles around \approx 0.87 over 30 epochs, a modest gain over the softmax regression baseline on the same data, bought by one hidden layer and its ReLU.

Concise

let the framework hold the parameters

The same model, declared

Concise

Stack the layers in the framework’s container. Lazy linear layers infer their input size; ReLU and Flatten come built in:

class MLP(d2l.Classifier):
    def __init__(self, num_outputs, num_hiddens, lr):
        super().__init__()
        self.save_hyperparameters()
        self.net = nn.Sequential(nn.Flatten(), nn.LazyLinear(num_hiddens),
                                 nn.ReLU(), nn.LazyLinear(num_outputs))

Sequential is the forward pass: apply each layer in turn. No hand-written forward, no parameter bookkeeping.

Same architecture as the diagram, four lines instead of two classes.

Same loop, same result

Concise

Reusing the very same trainer and data, the concise model converges just like the scratch one:

model = MLP(num_outputs=10, num_hiddens=256, lr=0.1)
trainer.fit(model, data)

Same architecture, different init (framework default vs \mathcal{N}(0, 0.01^2)), so trajectories differ slightly.

What’s next

from a working MLP to a reliable one

Four open questions

Where this goes

We have a working MLP. Making it reliable is the rest of this chapter:

Backprop (the forward/backward-propagation section): how gradients flow through an arbitrary stack.
Initialization (the numerical-stability section): choose \sigma so signals neither vanish nor explode through depth.
Generalization (the generalization-in-deep-learning section): why a flexible model does well on unseen data at all.
Regularization (the dropout section): dropout, and friends.

Each question gets its own section, and exercise 2 hands you the cliffhanger: add a second hidden layer while keeping \sigma = 0.01, and the deeper net trains worse. The numerical-stability section explains.

Recap

Wrap-up

An MLP = a linear classifier plus a hidden layer and a nonlinearity between affine maps.
From scratch: four parameter tensors, a hand-rolled ReLU, a two-line forward. Concrete, but tedious to ship.
Concise: declare the layer stack; Sequential holds the parameters and defines the forward pass.

Both forms declare the same architecture (inits differ).
The training loop is unchanged from softmax regression (modularity paying off).
Hyperparameters (depth, width, lr) live outside the model; the same loop trains any of them.
Validation accuracy settles around \approx 0.87, a modest gain from the hidden layer and its ReLU.

Next (the forward/backward-propagation section): open the black box, what backward() actually computes, by hand and then verified.