Convolutional Neural Networks (LeNet)

LeNet sets the CNN template

LeNet-5 (Yann LeCun et al., 1989; productionized 1998) was the first convolutional neural network at production scale: handwritten digits on U.S. bank checks. Some ATMs still run derivatives of the original C++ today.

It defined the architectural template every later CNN refines: a convolutional encoder (spatial dims shrink, channels grow) feeding a dense head. ResNet, EfficientNet, ViT: same skeleton, different components.

LeNet-5 architecture

LeNet-5 data flow on a 28×28 handwritten digit. Spatial dims shrink; channels grow.

Layer-by-layer

Conv1: 1→6 channels, 5×5 kernel, padding 2 (28→28)
AvgPool: stride 2 → 14×14
Conv2: 6→16 channels, 5×5, no padding → 10×10
AvgPool: stride 2 → 5×5
Flatten → 16·5·5 = 400 → 120 → 84 → 10

Two conv→sigmoid→avgpool blocks, three FC layers, 10 logits.

Compressed view

Same network, vertical schematic (the textbook version):

Compact LeNet-5 schematic.

Two takeaways

Pyramid shape: spatial halves at each pool; channels roughly double. Every successor architecture preserves this.
The bottleneck is the flatten: 400 × 120 = 48000 weights from conv block to first dense layer. Modern CNNs replace the dense stack with global average pooling, which is much cheaper.

Implementation setup

Almost mechanical translation from the figure to a Sequential. Xavier init keeps the sigmoid layers from saturating early in training:

from d2l import mxnet as d2l
from mxnet import autograd, gluon, init, np, npx
from mxnet.gluon import nn
npx.set_np()

LeNet in code

LeNet initialization

class LeNet(d2l.Classifier):
    """The LeNet-5 model."""
    def __init__(self, lr=0.1, num_classes=10):
        super().__init__()
        self.save_hyperparameters()
        self.net = nn.Sequential()
        self.net.add(
            nn.Conv2D(channels=6, kernel_size=5, padding=2,
                      activation='sigmoid'),
            nn.AvgPool2D(pool_size=2, strides=2),
            nn.Conv2D(channels=16, kernel_size=5, activation='sigmoid'),
            nn.AvgPool2D(pool_size=2, strides=2),
            nn.Dense(120, activation='sigmoid'),
            nn.Dense(84, activation='sigmoid'),
            nn.Dense(num_classes))
        self.net.initialize(init.Xavier())

Tracing shapes through the network

Critical debugging tool: walk a dummy (1, 1, 28, 28) input through the layers and print the shape after each. Match this against the figure to verify the architecture is wired correctly:

@d2l.add_to_class(d2l.Classifier)
def layer_summary(self, X_shape):
    X = d2l.randn(*X_shape)
    for layer in self.net:
        X = layer(X)
        print(layer.__class__.__name__, 'output shape:\t', X.shape)
        
model = LeNet()
model.layer_summary((1, 1, 28, 28))

Conv2D output shape:     (1, 6, 28, 28)
AvgPool2D output shape:  (1, 6, 14, 14)
Conv2D output shape:     (1, 16, 10, 10)
AvgPool2D output shape:  (1, 16, 5, 5)
Dense output shape:  (1, 120)
Dense output shape:  (1, 84)
Dense output shape:  (1, 10)

Confirms 28→28→14→10→5→flatten→120→84→10: exactly the pyramid in the diagram.

Training on Fashion-MNIST

Cross-entropy loss + SGD + 10 epochs. Same Trainer API as every previous chapter; only the model changes:

trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128)
model = LeNet(lr=0.1)
trainer.fit(model, data)

LeNet’s convolutional inductive bias clearly beats the dense MLP from the previous chapter on the same data, even with 1990s components (sigmoid, average pooling).

What 30 years of progress changed

LeNet’s 1998 architecture vs. modern best practice:

LeNet (1998)	Modern (2020s)
sigmoid activation	ReLU / GELU
average pooling	max pool / strided conv
no normalization	BatchNorm / LayerNorm
dense head	global average pool + 1 linear
Xavier init	He init
5 layers, ~60k params	50+ layers, millions of params

Each substitution is a section of the next chapter, Modern CNNs (He init we met in the builder’s guide). The skeleton, conv encoder + head, is unchanged.

Recap

LeNet-5 = first CNN that worked at production scale.
Architectural template: conv encoder (spatial ↓, channels ↑) → flatten → dense head.
Same template scales up to ResNet, EfficientNet, ViT: the modern variants change components, not the shape.
Beats MLPs on the same data: convolutional inductive bias is a real win.
The next chapter swaps every component for its modern equivalent and goes much deeper.