The ImageNet Moment: AlexNet

Before 2012: features were crafted

Classical vision pipelines never fed raw pixels to a classifier:

hand-engineered extractors (SIFT, SURF, bags of visual words) computed the representation;
a linear model or kernel method handled the final classification.

Progress meant inventing better features, not better learning.

The bet: learn the representation

LeCun, Hinton, Bengio, Ng, Amari, Schmidhuber: features should be learned, hierarchically, layer by layer.

AlexNet’s first layer learned filters that resemble the hand-crafted ones:

First-layer filters learned by AlexNet.

What changed: data and compute

ImageNet (2009): 1.2 M labeled images, 1000 classes, 224×224 resolution.
GPUs: from 1999 to 2012, throughput grew by roughly three orders of magnitude.
Plus the missing training tricks: ReLU, dropout, augmentation, better initialization.

AlexNet (Krizhevsky, Sutskever, Hinton, 2012) put them together and won ILSVRC 2012 by a large margin.

From LeNet to AlexNet

Same design, scaled up: convolutional stages, then a fully connected head.

LeNet and AlexNet side by side.

The architecture in code

Five conv layers (11×11 → 5×5 → three 3×3) with max-pooling, then two 4096-wide dense layers with dropout:

from d2l import tensorflow as d2l
import tensorflow as tf

class AlexNet(d2l.Classifier):
    def __init__(self, lr=0.1, num_classes=10):
        super().__init__()
        self.save_hyperparameters()
        self.net = tf.keras.models.Sequential([
            tf.keras.layers.Conv2D(filters=96, kernel_size=11, strides=4,
                                   activation='relu'),
            tf.keras.layers.MaxPool2D(pool_size=3, strides=2),
            tf.keras.layers.Conv2D(filters=256, kernel_size=5, padding='same',
                                   activation='relu'),
            tf.keras.layers.MaxPool2D(pool_size=3, strides=2),
            tf.keras.layers.Conv2D(filters=384, kernel_size=3, padding='same',
                                   activation='relu'),
            tf.keras.layers.Conv2D(filters=384, kernel_size=3, padding='same',
                                   activation='relu'),
            tf.keras.layers.Conv2D(filters=256, kernel_size=3, padding='same',
                                   activation='relu'),
            tf.keras.layers.MaxPool2D(pool_size=3, strides=2),
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(4096, activation='relu'),
            tf.keras.layers.Dropout(0.5),
            tf.keras.layers.Dense(4096, activation='relu'),
            tf.keras.layers.Dropout(0.5),
            tf.keras.layers.Dense(num_classes)])

Shape inspection

Walk a single 224×224 image through the network and print each block’s output shape, from 224×224 down to 6×6 at 256 channels:

AlexNet().layer_summary((1, 224, 224, 1))

Conv2D output shape:     (1, 54, 54, 96)
MaxPooling2D output shape:   (1, 26, 26, 96)
Conv2D output shape:     (1, 26, 26, 256)
MaxPooling2D output shape:   (1, 12, 12, 256)
Conv2D output shape:     (1, 12, 12, 384)
Conv2D output shape:     (1, 12, 12, 384)
...
Flatten output shape:    (1, 6400)
Dense output shape:  (1, 4096)
Dropout output shape:    (1, 4096)
Dense output shape:  (1, 4096)
Dropout output shape:    (1, 4096)
Dense output shape:  (1, 10)

Training on Fashion-MNIST

Upsample the 28×28 Fashion-MNIST images to the 224×224 input AlexNet expects, then train with a smaller learning rate than LeNet:

trainer = d2l.Trainer(max_epochs=10)
data = d2l.FashionMNIST(batch_size=128, resize=(224, 224))
with d2l.try_gpu():
    model = AlexNet(lr=0.01)
    trainer.fit(model, data)

Recap

AlexNet is LeNet’s recipe at scale: 8 layers, ~60 M parameters, ReLU, dropout, GPU training, ImageNet.
Learned features displaced a decade of hand-crafted pipelines.
Its huge dense head is costly; the architectures in the next sections trim it away step by step.