Design Spaces and the Big Picture

From hand design to design spaces

We’ve seen a sequence of hand-designed architectures (LeNet → AlexNet → VGG → GoogLeNet → ResNet → DenseNet), each a hypothesis about what makes nets work.

Can we design networks more systematically?

RegNet: design space search

RegNet (Radosavovic et al., 2020):

Define a parametric design space (AnyNet): same template, free hyperparameters.
Sample many networks, train each briefly, see how accuracy correlates with hyperparameter choices.
Constrain the design space based on what works.

Simple closed-form rules (“width grows linearly with stage”) outperform years of expert tuning.

The AnyNet design space

The AnyNet design space.

Stem (low-level conv) → 4 stages of residual blocks → head (global pool + linear). Each stage’s depth, width, group count are free parameters:

from d2l import torch as d2l
import torch
from torch import nn
from torch.nn import functional as F

AnyNet stem

The stem is deliberately plain: one stride-2 3×3 convolution, BatchNorm, ReLU. Its job is to halve resolution and create the first feature channels before the repeated stages begin.

class AnyNet(d2l.Classifier):
    def stem(self, num_channels):
        return nn.Sequential(
            nn.LazyConv2d(num_channels, kernel_size=3, stride=2, padding=1),
            nn.LazyBatchNorm2d(), nn.ReLU())

AnyNet stage

Each stage repeats the same ResNeXt block. The first block uses stride 2 and a 1×1 skip projection to change resolution and channel count; the rest preserve shape.

def stage(self, depth, num_channels, groups, bot_mul):
    blk = []
    for i in range(depth):
        if i == 0:
            blk.append(d2l.ResNeXtBlock(num_channels, groups, bot_mul,
                use_1x1conv=True, strides=2))
        else:
            blk.append(d2l.ResNeXtBlock(num_channels, groups, bot_mul))
    return nn.Sequential(*blk)

AnyNet assembly

The architecture tuple supplies (depth, channels, groups, bottleneck) per stage. The head is the now-standard global average pool + linear classifier.

def __init__(self, arch, stem_channels, lr=0.1, num_classes=10):
    super(AnyNet, self).__init__()
    self.save_hyperparameters()
    self.net = nn.Sequential(self.stem(stem_channels))
    for i, s in enumerate(arch):
        self.net.add_module(f'stage{i+1}', self.stage(*s))
    self.net.add_module('head', nn.Sequential(
        nn.AdaptiveAvgPool2d((1, 1)), nn.Flatten(),
        nn.LazyLinear(num_classes)))
    self.net.apply(d2l.init_cnn)

RegNet design-space evidence

Comparing error empirical distribution functions of design spaces.

RegNet narrows AnyNet with simple constraints: stage widths grow approximately linearly, bottleneck ratios stay fixed, and group widths are shared across stages. The result is a smaller search space with better probability of good models.

A RegNetX-3.2GF instance

The paper’s empirical findings collapse to: width grows linearly with stage, depth stays roughly constant, ResNeXt-style groups. A scaled-down version for Fashion-MNIST:

class RegNetX32(AnyNet):
    def __init__(self, lr=0.1, num_classes=10):
        stem_channels, groups, bot_mul = 32, 16, 1
        depths, channels = (4, 6), (32, 80)
        super().__init__(
            ((depths[0], channels[0], groups, bot_mul),
             (depths[1], channels[1], groups, bot_mul)),
            stem_channels, lr, num_classes)

RegNetX32().layer_summary((1, 1, 96, 96))

Sequential output shape:     torch.Size([1, 32, 48, 48])
Sequential output shape:     torch.Size([1, 32, 24, 24])
Sequential output shape:     torch.Size([1, 80, 12, 12])
Sequential output shape:     torch.Size([1, 10])

Training

model = RegNetX32(lr=0.05)
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128, resize=(96, 96))
trainer.fit(model, data)

The architecture is competitive with hand-designed ResNets at similar parameter counts, and the discovery process scales trivially with compute.

The big picture: ConvNets vs. Transformers

Vision Transformers overtook CNNs on large-scale classification around 2021.
The scaling resolution: with modern recipes and equal compute, ConvNets match ViTs (NFNets at JFT-4B scale; ConvNeXt).
The 2021 gap was mostly recipe + scale, not representational power.

The division of labor (2026)

Transformers: foundation-scale pretraining, multimodal stacks, billion-scale image-text corpora.
ConvNets: edge/latency, small data, dense prediction (nnU-Net still wins medical segmentation).
Conv stems persist inside Transformers (Whisper); diffusion moved U-Net → DiT at the frontier, conv U-Nets deployed widely.
Inductive bias is a data-efficiency dial, not a ceiling.

Recap

Modern architecture design = search over a parametric design space, not heroic engineering.
AnyNet specifies the template (stem / 4 stages / head); the empirical search picks widths, depths, and groups.
Resulting networks (RegNet) match or beat hand-designed rivals with simpler, more interpretable rules.
The same philosophy (fit simple laws to populations of models) drives today’s scaling-law-guided design.