LeNet in code

Convolutional Neural Networks (LeNet)

LeNet sets the CNN template

LeNet-5 (Yann LeCun et al., 1989; productionized 1998) was the first convolutional neural network at production scale — handwritten digits on U.S. bank checks. Some ATMs still run derivatives of the original C++ today.

It defined the architectural template every later CNN refines: a convolutional encoder (spatial dims shrink, channels grow) feeding a dense head. ResNet, EfficientNet, ViT — same skeleton, different components.

LeNet-5 architecture

LeNet-5 data flow on a 28×28 handwritten digit. Spatial dims shrink; channels grow.

Layer-by-layer

  • Conv1: 1→6 channels, 5×5 kernel, padding 2 (28→28)
  • AvgPool: stride 2 → 14×14
  • Conv2: 6→16 channels, 5×5, no padding → 10×10
  • AvgPool: stride 2 → 5×5
  • Flatten → 16·5·5 = 400 → 120 → 84 → 10

Two conv→sigmoid→avgpool blocks, three FC layers, 10 logits.

Compressed view

Same network, vertical schematic — the textbook version:

Compact LeNet-5 schematic.

Two takeaways

  • Pyramid shape — spatial halves at each pool; channels roughly double. Every successor architecture preserves this.
  • The bottleneck is the flatten400 × 120 = 48000 weights from conv block to first dense layer. Modern CNNs replace the dense stack with global average pooling — much cheaper.

Implementation setup

Almost mechanical translation from the figure to a Sequential. Xavier init keeps the sigmoid layers from saturating early in training:

import tensorflow as tf
from d2l import tensorflow as d2l

LeNet initialization

class LeNet(d2l.Classifier):
    """The LeNet-5 model."""
    def __init__(self, lr=0.1, num_classes=10):
        super().__init__()
        self.save_hyperparameters()
        self.net = tf.keras.models.Sequential([
            tf.keras.layers.Conv2D(filters=6, kernel_size=5,
                                   activation='sigmoid', padding='same'),
            tf.keras.layers.AvgPool2D(pool_size=2, strides=2),
            tf.keras.layers.Conv2D(filters=16, kernel_size=5,
                                   activation='sigmoid'),
            tf.keras.layers.AvgPool2D(pool_size=2, strides=2),
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(120, activation='sigmoid'),
            tf.keras.layers.Dense(84, activation='sigmoid'),
            tf.keras.layers.Dense(num_classes)])

Tracing shapes through the network

Critical debugging tool: walk a dummy (1, 1, 28, 28) input through the layers and print the shape after each. Match this against the figure to verify the architecture is wired correctly:

@d2l.add_to_class(d2l.Classifier)
def layer_summary(self, X_shape):
    X = d2l.normal(X_shape)
    for layer in self.net.layers:
        X = layer(X)
        print(layer.__class__.__name__, 'output shape:\t', X.shape)

model = LeNet()
model.layer_summary((1, 28, 28, 1))
Conv2D output shape:     (1, 28, 28, 6)
AveragePooling2D output shape:   (1, 14, 14, 6)
Conv2D output shape:     (1, 10, 10, 16)
AveragePooling2D output shape:   (1, 5, 5, 16)
Flatten output shape:    (1, 400)
Dense output shape:  (1, 120)
Dense output shape:  (1, 84)
Dense output shape:  (1, 10)

Confirms 28→28→14→10→5→flatten→120→84→10 — exactly the pyramid in the diagram.

Training on Fashion-MNIST

Cross-entropy loss + SGD + 10 epochs. Same Trainer API as every previous chapter — only the model changes:

trainer = d2l.Trainer(max_epochs=10)
data = d2l.FashionMNIST(batch_size=128)
with d2l.try_gpu():
    model = LeNet(lr=0.1)
    trainer.fit(model, data)

LeNet’s convolutional inductive bias clearly beats the dense MLP from the previous chapter on the same data — even with 1990s components (sigmoid, average pooling).

What 30 years of progress changed

LeNet’s 1998 architecture vs. modern best practice:

LeNet (1998) Modern (2020s)
sigmoid activation ReLU / GELU
average pooling max pool / strided conv
no normalization BatchNorm / LayerNorm
Xavier init He init
5 layers, ~60k params 50+ layers, millions of params
dense head global average pool + 1 linear

Each substitution is the subject of a section in the next chapter (Modern CNNs). The skeleton — conv encoder + head — is unchanged.

Recap

  • LeNet-5 = first CNN that worked at production scale.
  • Architectural template: conv encoder (spatial ↓, channels ↑) → flatten → dense head.
  • Same template scales up to ResNet, EfficientNet, ViT — the modern variants change components, not the shape.
  • Beats MLPs on the same data — convolutional inductive bias is a real win.
  • The next chapter swaps every component for its modern equivalent and goes much deeper.