LeNet in code

Convolutional Neural Networks (LeNet)

LeNet sets the CNN template

LeNet-5 (Yann LeCun et al., 1989; productionized 1998) was the first convolutional neural network at production scale — handwritten digits on U.S. bank checks. Some ATMs still run derivatives of the original C++ today.

It defined the architectural template every later CNN refines: a convolutional encoder (spatial dims shrink, channels grow) feeding a dense head. ResNet, EfficientNet, ViT — same skeleton, different components.

LeNet-5 architecture

LeNet-5 data flow on a 28×28 handwritten digit. Spatial dims shrink; channels grow.

Layer-by-layer

  • Conv1: 1→6 channels, 5×5 kernel, padding 2 (28→28)
  • AvgPool: stride 2 → 14×14
  • Conv2: 6→16 channels, 5×5, no padding → 10×10
  • AvgPool: stride 2 → 5×5
  • Flatten → 16·5·5 = 400 → 120 → 84 → 10

Two conv→sigmoid→avgpool blocks, three FC layers, 10 logits.

Compressed view

Same network, vertical schematic — the textbook version:

Compact LeNet-5 schematic.

Two takeaways

  • Pyramid shape — spatial halves at each pool; channels roughly double. Every successor architecture preserves this.
  • The bottleneck is the flatten400 × 120 = 48000 weights from conv block to first dense layer. Modern CNNs replace the dense stack with global average pooling — much cheaper.

Implementation setup

Almost mechanical translation from the figure to a Sequential. Xavier init keeps the sigmoid layers from saturating early in training:

from d2l import torch as d2l
import torch
from torch import nn
def init_cnn(module):
    """Initialize weights for CNNs."""
    if type(module) == nn.Linear or type(module) == nn.Conv2d:
        nn.init.xavier_uniform_(module.weight)

LeNet initialization

class LeNet(d2l.Classifier):
    """The LeNet-5 model."""
    def __init__(self, lr=0.1, num_classes=10):
        super().__init__()
        self.save_hyperparameters()
        self.net = nn.Sequential(
            nn.LazyConv2d(6, kernel_size=5, padding=2), nn.Sigmoid(),
            nn.AvgPool2d(kernel_size=2, stride=2),
            nn.LazyConv2d(16, kernel_size=5), nn.Sigmoid(),
            nn.AvgPool2d(kernel_size=2, stride=2),
            nn.Flatten(),
            nn.LazyLinear(120), nn.Sigmoid(),
            nn.LazyLinear(84), nn.Sigmoid(),
            nn.LazyLinear(num_classes))

Tracing shapes through the network

Critical debugging tool: walk a dummy (1, 1, 28, 28) input through the layers and print the shape after each. Match this against the figure to verify the architecture is wired correctly:

@d2l.add_to_class(d2l.Classifier)
def layer_summary(self, X_shape):
    X = d2l.randn(*X_shape)
    for layer in self.net:
        X = layer(X)
        print(layer.__class__.__name__, 'output shape:\t', X.shape)
        
model = LeNet()
model.layer_summary((1, 1, 28, 28))
Conv2d output shape:     torch.Size([1, 6, 28, 28])
Sigmoid output shape:    torch.Size([1, 6, 28, 28])
AvgPool2d output shape:  torch.Size([1, 6, 14, 14])
Conv2d output shape:     torch.Size([1, 16, 10, 10])
Sigmoid output shape:    torch.Size([1, 16, 10, 10])
AvgPool2d output shape:  torch.Size([1, 16, 5, 5])
Flatten output shape:    torch.Size([1, 400])
Linear output shape:     torch.Size([1, 120])
Sigmoid output shape:    torch.Size([1, 120])
Linear output shape:     torch.Size([1, 84])
Sigmoid output shape:    torch.Size([1, 84])
Linear output shape:     torch.Size([1, 10])

Confirms 28→28→14→10→5→flatten→120→84→10 — exactly the pyramid in the diagram.

Training on Fashion-MNIST

Cross-entropy loss + SGD + 10 epochs. Same Trainer API as every previous chapter — only the model changes:

trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128)
model = LeNet(lr=0.1)
model.apply_init([next(iter(data.get_dataloader(True)))[0]], init_cnn)
trainer.fit(model, data)

LeNet’s convolutional inductive bias clearly beats the dense MLP from the previous chapter on the same data — even with 1990s components (sigmoid, average pooling).

What 30 years of progress changed

LeNet’s 1998 architecture vs. modern best practice:

LeNet (1998) Modern (2020s)
sigmoid activation ReLU / GELU
average pooling max pool / strided conv
no normalization BatchNorm / LayerNorm
Xavier init He init
5 layers, ~60k params 50+ layers, millions of params
dense head global average pool + 1 linear

Each substitution is the subject of a section in the next chapter (Modern CNNs). The skeleton — conv encoder + head — is unchanged.

Recap

  • LeNet-5 = first CNN that worked at production scale.
  • Architectural template: conv encoder (spatial ↓, channels ↑) → flatten → dense head.
  • Same template scales up to ResNet, EfficientNet, ViT — the modern variants change components, not the shape.
  • Beats MLPs on the same data — convolutional inductive bias is a real win.
  • The next chapter swaps every component for its modern equivalent and goes much deeper.