The VGG network

Networks Using Blocks (VGG)

VGG: regular blocks at scale

VGG (Simonyan & Zisserman, 2014) is AlexNet taken seriously: stack more layers, but make them regular.

The contribution wasn’t a clever architecture — it was a design principle: regular blocks of 3×3 conv + ReLU, ending in a 2×2 max-pool. Whole network = a sequence of such blocks at growing channel counts.

From AlexNet’s hand-tuned layers to VGG’s repeated 3×3 blocks.

Why 3×3 convs only

Two stacked 3×3 convs cover the same receptive field as one 5×5 — fewer parameters, one extra nonlinearity.
All convs are stride 1 — easier to reason about, surprisingly competitive with hand-designed kernels.
The architecture becomes a tuple of (n_convs, channels) pairs; pass a different tuple for VGG-13/16/19.

Receptive field arithmetic

Stacking small kernels grows the visible patch without paying for a large kernel in one step.

For stride 1 and no dilation:

r_L = 1 + \sum_{\ell=1}^L (k_\ell - 1).

Two 3×3 convolutions see

1 + (3 - 1) + (3 - 1) = 5

pixels across: the same 5×5 receptive field as one 5×5 conv, but with two ReLUs and fewer weights.

The VGG block

A reusable subunit: n_convs consecutive Conv-ReLU pairs at out_channels, followed by a 2×2 MaxPool:

from d2l import torch as d2l
import torch
from torch import nn

def vgg_block(num_convs, out_channels):
    layers = []
    for _ in range(num_convs):
        layers.append(nn.LazyConv2d(out_channels, kernel_size=3, padding=1))
        layers.append(nn.ReLU())
    layers.append(nn.MaxPool2d(kernel_size=2,stride=2))
    return nn.Sequential(*layers)

A whole VGG-11 (the smallest variant) is just five blocks at growing channel counts (64, 128, 256, 512, 512) plus a 3-layer dense head:

class VGG(d2l.Classifier):
    def __init__(self, arch, lr=0.1, num_classes=10):
        super().__init__()
        self.save_hyperparameters()
        conv_blks = []
        for (num_convs, out_channels) in arch:
            conv_blks.append(vgg_block(num_convs, out_channels))
        self.net = nn.Sequential(
            *conv_blks, nn.Flatten(),
            nn.LazyLinear(4096), nn.ReLU(), nn.Dropout(0.5),
            nn.LazyLinear(4096), nn.ReLU(), nn.Dropout(0.5),
            nn.LazyLinear(num_classes))
        self.net.apply(d2l.init_cnn)

VGG(arch=((1, 64), (1, 128), (2, 256), (2, 512), (2, 512))).layer_summary(
    (1, 1, 224, 224))

Sequential output shape:     torch.Size([1, 64, 112, 112])
Sequential output shape:     torch.Size([1, 128, 56, 56])
Sequential output shape:     torch.Size([1, 256, 28, 28])
Sequential output shape:     torch.Size([1, 512, 14, 14])
Sequential output shape:     torch.Size([1, 512, 7, 7])
Flatten output shape:    torch.Size([1, 25088])
...
ReLU output shape:   torch.Size([1, 4096])
Dropout output shape:    torch.Size([1, 4096])
Linear output shape:     torch.Size([1, 4096])
ReLU output shape:   torch.Size([1, 4096])
Dropout output shape:    torch.Size([1, 4096])
Linear output shape:     torch.Size([1, 10])

The “named architecture” is just a tuple of (n_convs, channels) pairs — passing a different tuple gives you VGG-13/16/19.

Training (a thin VGG)

Full VGG-11 is heavy for a notebook. Train a thinned version (channels 16/32/64/128/128) on Fashion-MNIST as a smoke test:

model = VGG(arch=((1, 16), (1, 32), (2, 64), (2, 128), (2, 128)), lr=0.01)
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128, resize=(224, 224))
model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
trainer.fit(model, data)

Validates the block-at-scale design principle without melting your GPU.

Recap

VGG = “stack identical, regular blocks.” A block is n × 3×3 conv + ReLU + maxpool.
Two 3×3 convs ≈ one 5×5 receptive field, with fewer params and more nonlinearity.
The architecture-as-a-tuple-of-blocks pattern (((1, 64), (1, 128), (2, 256), …)) is everywhere — VGG, ResNet, EfficientNet, ConvNeXt all use it.