from d2l import tensorflow as d2l
import tensorflow as tfAlexNet (Krizhevsky, Sutskever, Hinton — 2012) is what made deep learning the approach to vision. Won ImageNet by a huge margin and started the modern era.
AlexNet alongside the LeNet from a decade earlier.
The architecture itself is straightforward; what changed was the scale.
Five conv layers (11×11 → 5×5 → three 3×3) + max-pool, then three FC layers down to 1000 classes:
class AlexNet(d2l.Classifier):
def __init__(self, lr=0.1, num_classes=10):
super().__init__()
self.save_hyperparameters()
self.net = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(filters=96, kernel_size=11, strides=4,
activation='relu'),
tf.keras.layers.MaxPool2D(pool_size=3, strides=2),
tf.keras.layers.Conv2D(filters=256, kernel_size=5, padding='same',
activation='relu'),
tf.keras.layers.MaxPool2D(pool_size=3, strides=2),
tf.keras.layers.Conv2D(filters=384, kernel_size=3, padding='same',
activation='relu'),
tf.keras.layers.Conv2D(filters=384, kernel_size=3, padding='same',
activation='relu'),
tf.keras.layers.Conv2D(filters=256, kernel_size=3, padding='same',
activation='relu'),
tf.keras.layers.MaxPool2D(pool_size=3, strides=2),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(4096, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(4096, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(num_classes)])Walk a single 1×1×224×224 image through and print each block’s output shape — the feature pyramid going from 224×224×1 down to 6×6×256:
Conv2D output shape: (1, 54, 54, 96)
MaxPooling2D output shape: (1, 26, 26, 96)
Conv2D output shape: (1, 26, 26, 256)
MaxPooling2D output shape: (1, 12, 12, 256)
Conv2D output shape: (1, 12, 12, 384)
Conv2D output shape: (1, 12, 12, 384)
...
Flatten output shape: (1, 6400)
Dense output shape: (1, 4096)
Dropout output shape: (1, 4096)
Dense output shape: (1, 4096)
Dropout output shape: (1, 4096)
Dense output shape: (1, 10)
For demonstration, upsample the 28×28 Fashion-MNIST images to the 224×224 input AlexNet expects, then train at lr=0.01:
Trains slowly even on a GPU — AlexNet has ~10× the parameters of LeNet. The architecture’s lasting contribution: it proved that bigger is better when paired with enough data and compute.