The ImageNet Moment: AlexNet

Before 2012: features were crafted

Classical vision pipelines never fed raw pixels to a classifier:

hand-engineered extractors (SIFT, SURF, bags of visual words) computed the representation;
a linear model or kernel method handled the final classification.

Progress meant inventing better features, not better learning.

The bet: learn the representation

LeCun, Hinton, Bengio, Ng, Amari, Schmidhuber: features should be learned, hierarchically, layer by layer.

AlexNet’s first layer learned filters that resemble the hand-crafted ones:

First-layer filters learned by AlexNet.

What changed: data and compute

ImageNet (2009): 1.2 M labeled images, 1000 classes, 224×224 resolution.
GPUs: from 1999 to 2012, throughput grew by roughly three orders of magnitude.
Plus the missing training tricks: ReLU, dropout, augmentation, better initialization.

AlexNet (Krizhevsky, Sutskever, Hinton, 2012) put them together and won ILSVRC 2012 by a large margin.

From LeNet to AlexNet

Same design, scaled up: convolutional stages, then a fully connected head.

LeNet and AlexNet side by side.

The architecture in code

Five conv layers (11×11 → 5×5 → three 3×3) with max-pooling, then two 4096-wide dense layers with dropout:

from d2l import torch as d2l
import torch
from torch import nn

class AlexNet(d2l.Classifier):
    def __init__(self, lr=0.1, num_classes=10):
        super().__init__()
        self.save_hyperparameters()
        self.net = nn.Sequential(
            nn.LazyConv2d(96, kernel_size=11, stride=4),
            nn.ReLU(), nn.MaxPool2d(kernel_size=3, stride=2),
            nn.LazyConv2d(256, kernel_size=5, padding=2), nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.LazyConv2d(384, kernel_size=3, padding=1), nn.ReLU(),
            nn.LazyConv2d(384, kernel_size=3, padding=1), nn.ReLU(),
            nn.LazyConv2d(256, kernel_size=3, padding=1), nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2), nn.Flatten(),
            nn.LazyLinear(4096), nn.ReLU(), nn.Dropout(p=0.5),
            nn.LazyLinear(4096), nn.ReLU(),nn.Dropout(p=0.5),
            nn.LazyLinear(num_classes))
        # Note: lazy layers have no parameters at construction time, so weight
        # initialization (d2l.init_cnn) is applied later via apply_init after
        # a dummy forward pass materializes the parameters.

Shape inspection

Walk a single 224×224 image through the network and print each block’s output shape, from 224×224 down to 6×6 at 256 channels:

AlexNet().layer_summary((1, 1, 224, 224))

Conv2d output shape:     torch.Size([1, 96, 54, 54])
ReLU output shape:   torch.Size([1, 96, 54, 54])
MaxPool2d output shape:  torch.Size([1, 96, 26, 26])
Conv2d output shape:     torch.Size([1, 256, 26, 26])
ReLU output shape:   torch.Size([1, 256, 26, 26])
MaxPool2d output shape:  torch.Size([1, 256, 12, 12])
...
ReLU output shape:   torch.Size([1, 4096])
Dropout output shape:    torch.Size([1, 4096])
Linear output shape:     torch.Size([1, 4096])
ReLU output shape:   torch.Size([1, 4096])
Dropout output shape:    torch.Size([1, 4096])
Linear output shape:     torch.Size([1, 10])

Training on Fashion-MNIST

Upsample the 28×28 Fashion-MNIST images to the 224×224 input AlexNet expects, then train with a smaller learning rate than LeNet:

model = AlexNet(lr=0.01)
data = d2l.FashionMNIST(batch_size=128, resize=(224, 224))
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
# Lazy layers have no weights at construction time; apply_init runs a
# dummy forward pass to materialize parameters and then applies init_cnn.
model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
trainer.fit(model, data)

Recap

AlexNet is LeNet’s recipe at scale: 8 layers, ~60 M parameters, ReLU, dropout, GPU training, ImageNet.
Learned features displaced a decade of hand-crafted pipelines.
Its huge dense head is costly; the architectures in the next sections trim it away step by step.