from d2l import torch as d2l
import torch
from torch import nnAlexNet (Krizhevsky, Sutskever, Hinton — 2012) is what made deep learning the approach to vision. Won ImageNet by a huge margin and started the modern era.
AlexNet alongside the LeNet from a decade earlier.
The architecture itself is straightforward; what changed was the scale.
Five conv layers (11×11 → 5×5 → three 3×3) + max-pool, then three FC layers down to 1000 classes:
class AlexNet(d2l.Classifier):
def __init__(self, lr=0.1, num_classes=10):
super().__init__()
self.save_hyperparameters()
self.net = nn.Sequential(
nn.LazyConv2d(96, kernel_size=11, stride=4, padding=1),
nn.ReLU(), nn.MaxPool2d(kernel_size=3, stride=2),
nn.LazyConv2d(256, kernel_size=5, padding=2), nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2),
nn.LazyConv2d(384, kernel_size=3, padding=1), nn.ReLU(),
nn.LazyConv2d(384, kernel_size=3, padding=1), nn.ReLU(),
nn.LazyConv2d(256, kernel_size=3, padding=1), nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2), nn.Flatten(),
nn.LazyLinear(4096), nn.ReLU(), nn.Dropout(p=0.5),
nn.LazyLinear(4096), nn.ReLU(),nn.Dropout(p=0.5),
nn.LazyLinear(num_classes))
# Note: lazy layers have no parameters at construction time, so weight
# initialization (d2l.init_cnn) is applied later via apply_init after
# a dummy forward pass materializes the parameters.Walk a single 1×1×224×224 image through and print each block’s output shape — the feature pyramid going from 224×224×1 down to 6×6×256:
Conv2d output shape: torch.Size([1, 96, 54, 54])
ReLU output shape: torch.Size([1, 96, 54, 54])
MaxPool2d output shape: torch.Size([1, 96, 26, 26])
Conv2d output shape: torch.Size([1, 256, 26, 26])
ReLU output shape: torch.Size([1, 256, 26, 26])
MaxPool2d output shape: torch.Size([1, 256, 12, 12])
...
ReLU output shape: torch.Size([1, 4096])
Dropout output shape: torch.Size([1, 4096])
Linear output shape: torch.Size([1, 4096])
ReLU output shape: torch.Size([1, 4096])
Dropout output shape: torch.Size([1, 4096])
Linear output shape: torch.Size([1, 10])
For demonstration, upsample the 28×28 Fashion-MNIST images to the 224×224 input AlexNet expects, then train at lr=0.01:
model = AlexNet(lr=0.01)
data = d2l.FashionMNIST(batch_size=128, resize=(224, 224))
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
# Lazy layers have no weights at construction time; apply_init runs a
# dummy forward pass to materialize parameters and then applies init_cnn.
model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
trainer.fit(model, data)Trains slowly even on a GPU — AlexNet has ~10× the parameters of LeNet. The architecture’s lasting contribution: it proved that bigger is better when paired with enough data and compute.