from d2l import mxnet as d2l
from mxnet import autograd, gluon, init, np, npx
from mxnet.gluon import nn
npx.set_np()LeNet-5 (Yann LeCun et al., 1989; productionized 1998) was the first convolutional neural network at production scale — handwritten digits on U.S. bank checks. Some ATMs still run derivatives of the original C++ today.
It defined the architectural template every later CNN refines: a convolutional encoder (spatial dims shrink, channels grow) feeding a dense head. ResNet, EfficientNet, ViT — same skeleton, different components.
LeNet-5 data flow on a 28×28 handwritten digit. Spatial dims shrink; channels grow.
Two conv→sigmoid→avgpool blocks, three FC layers, 10 logits.
Same network, vertical schematic — the textbook version:
Compact LeNet-5 schematic.
400 × 120 = 48000 weights from conv block to first dense layer. Modern CNNs replace the dense stack with global average pooling — much cheaper.Almost mechanical translation from the figure to a Sequential. Xavier init keeps the sigmoid layers from saturating early in training:
class LeNet(d2l.Classifier):
"""The LeNet-5 model."""
def __init__(self, lr=0.1, num_classes=10):
super().__init__()
self.save_hyperparameters()
self.net = nn.Sequential()
self.net.add(
nn.Conv2D(channels=6, kernel_size=5, padding=2,
activation='sigmoid'),
nn.AvgPool2D(pool_size=2, strides=2),
nn.Conv2D(channels=16, kernel_size=5, activation='sigmoid'),
nn.AvgPool2D(pool_size=2, strides=2),
nn.Dense(120, activation='sigmoid'),
nn.Dense(84, activation='sigmoid'),
nn.Dense(num_classes))
self.net.initialize(init.Xavier())Critical debugging tool: walk a dummy (1, 1, 28, 28) input through the layers and print the shape after each. Match this against the figure to verify the architecture is wired correctly:
Confirms 28→28→14→10→5→flatten→120→84→10 — exactly the pyramid in the diagram.
Cross-entropy loss + SGD + 10 epochs. Same Trainer API as every previous chapter — only the model changes:
LeNet’s convolutional inductive bias clearly beats the dense MLP from the previous chapter on the same data — even with 1990s components (sigmoid, average pooling).
LeNet’s 1998 architecture vs. modern best practice:
| LeNet (1998) | Modern (2020s) |
|---|---|
| sigmoid activation | ReLU / GELU |
| average pooling | max pool / strided conv |
| no normalization | BatchNorm / LayerNorm |
| Xavier init | He init |
| 5 layers, ~60k params | 50+ layers, millions of params |
| dense head | global average pool + 1 linear |
Each substitution is the subject of a section in the next chapter (Modern CNNs). The skeleton — conv encoder + head — is unchanged.