The ImageNet Moment: AlexNet

Before 2012: features were crafted

Classical vision pipelines never fed raw pixels to a classifier:

hand-engineered extractors (SIFT, SURF, bags of visual words) computed the representation;
a linear model or kernel method handled the final classification.

Progress meant inventing better features, not better learning.

The bet: learn the representation

LeCun, Hinton, Bengio, Ng, Amari, Schmidhuber: features should be learned, hierarchically, layer by layer.

AlexNet’s first layer learned filters that resemble the hand-crafted ones:

First-layer filters learned by AlexNet.

What changed: data and compute

ImageNet (2009): 1.2 M labeled images, 1000 classes, 224×224 resolution.
GPUs: from 1999 to 2012, throughput grew by roughly three orders of magnitude.
Plus the missing training tricks: ReLU, dropout, augmentation, better initialization.

AlexNet (Krizhevsky, Sutskever, Hinton, 2012) put them together and won ILSVRC 2012 by a large margin.

From LeNet to AlexNet

Same design, scaled up: convolutional stages, then a fully connected head.

LeNet and AlexNet side by side.

The architecture in code

Five conv layers (11×11 → 5×5 → three 3×3) with max-pooling, then two 4096-wide dense layers with dropout:

from d2l import jax as d2l
from flax import nnx
from jax import numpy as jnp

class AlexNet(d2l.Classifier):
    def __init__(self, lr=0.1, num_classes=10, rngs=None):
        super().__init__()
        self.save_hyperparameters(ignore=['rngs'])
        rngs = (nnx.Rngs(params=d2l.get_key(), dropout=d2l.get_key())
                if rngs is None else rngs)
        self.net = nnx.Sequential(
            nnx.Conv(1, 96, kernel_size=(11, 11), strides=4,
                     padding='VALID', rngs=rngs),
            nnx.relu,
            lambda x: nnx.max_pool(x, window_shape=(3, 3), strides=(2, 2)),
            nnx.Conv(96, 256, kernel_size=(5, 5), rngs=rngs),
            nnx.relu,
            lambda x: nnx.max_pool(x, window_shape=(3, 3), strides=(2, 2)),
            nnx.Conv(256, 384, kernel_size=(3, 3), rngs=rngs), nnx.relu,
            nnx.Conv(384, 384, kernel_size=(3, 3), rngs=rngs), nnx.relu,
            nnx.Conv(384, 256, kernel_size=(3, 3), rngs=rngs), nnx.relu,
            lambda x: nnx.max_pool(x, window_shape=(3, 3), strides=(2, 2)),
            lambda x: x.reshape((x.shape[0], -1)),  # flatten
            nnx.Linear(5 * 5 * 256, 4096, rngs=rngs),
            nnx.relu,
            nnx.Dropout(0.5, rngs=rngs),
            nnx.Linear(4096, 4096, rngs=rngs),
            nnx.relu,
            nnx.Dropout(0.5, rngs=rngs),
            nnx.Linear(4096, num_classes, rngs=rngs))

Shape inspection

Walk a single 224×224 image through the network and print each block’s output shape, from 224×224 down to 6×6 at 256 channels:

AlexNet().layer_summary((1, 224, 224, 1))

Conv output shape:   (1, 54, 54, 96)
custom_jvp output shape:     (1, 54, 54, 96)
function output shape:   (1, 26, 26, 96)
Conv output shape:   (1, 26, 26, 256)
custom_jvp output shape:     (1, 26, 26, 256)
function output shape:   (1, 12, 12, 256)
...
custom_jvp output shape:     (1, 4096)
Dropout output shape:    (1, 4096)
Linear output shape:     (1, 4096)
custom_jvp output shape:     (1, 4096)
Dropout output shape:    (1, 4096)
Linear output shape:     (1, 10)

Training on Fashion-MNIST

Upsample the 28×28 Fashion-MNIST images to the 224×224 input AlexNet expects, then train with a smaller learning rate than LeNet:

model = AlexNet(lr=0.01)
data = d2l.FashionMNIST(batch_size=128, resize=(224, 224))
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
trainer.fit(model, data)

Recap

AlexNet is LeNet’s recipe at scale: 8 layers, ~60 M parameters, ReLU, dropout, GPU training, ImageNet.
Learned features displaced a decade of hand-crafted pipelines.
Its huge dense head is costly; the architectures in the next sections trim it away step by step.