Design Spaces and the Big Picture

From hand design to design spaces

We’ve seen a sequence of hand-designed architectures (LeNet → AlexNet → VGG → GoogLeNet → ResNet → DenseNet), each a hypothesis about what makes nets work.

Can we design networks more systematically?

RegNet: design space search

RegNet (Radosavovic et al., 2020):

Define a parametric design space (AnyNet): same template, free hyperparameters.
Sample many networks, train each briefly, see how accuracy correlates with hyperparameter choices.
Constrain the design space based on what works.

Simple closed-form rules (“width grows linearly with stage”) outperform years of expert tuning.

The AnyNet design space

The AnyNet design space.

Stem (low-level conv) → 4 stages of residual blocks → head (global pool + linear). Each stage’s depth, width, group count are free parameters:

from d2l import jax as d2l
from flax import nnx
import jax

AnyNet stem

The stem is deliberately plain: one stride-2 3×3 convolution, BatchNorm, ReLU. Its job is to halve resolution and create the first feature channels before the repeated stages begin.

class AnyNet(d2l.Classifier):
    def __init__(self, arch, stem_channels, lr=0.1, num_classes=10,
                 in_channels=1, rngs=None):
        super().__init__()
        self.save_hyperparameters(ignore=['rngs'])
        rngs = nnx.Rngs(d2l.get_key()) if rngs is None else rngs
        self.net = self.create_net(in_channels, rngs)

    def stem(self, in_channels, num_channels, rngs):
        return nnx.Sequential(
            nnx.Conv(in_channels, num_channels, kernel_size=(3, 3),
                     strides=(2, 2), padding=(1, 1), rngs=rngs),
            nnx.BatchNorm(num_channels, rngs=rngs), nnx.relu)

AnyNet stage

Each stage repeats the same ResNeXt block. The first block uses stride 2 and a 1×1 skip projection to change resolution and channel count; the rest preserve shape.

def stage(self, depth, num_channels, groups, bot_mul, in_channels, rngs):
    blk = []
    for i in range(depth):
        if i == 0:
            blk.append(d2l.ResNeXtBlock(num_channels, groups, bot_mul,
                use_1x1conv=True, strides=(2, 2), in_channels=in_channels,
                rngs=rngs))
        else:
            blk.append(d2l.ResNeXtBlock(num_channels, groups, bot_mul,
                                        in_channels=num_channels, rngs=rngs))
    return nnx.Sequential(*blk)

AnyNet assembly

The architecture tuple supplies (depth, channels, groups, bottleneck) per stage. The head is the now-standard global average pool + linear classifier.

def create_net(self, in_channels, rngs):
    layers = [self.stem(in_channels, self.stem_channels, rngs)]
    stage_channels = self.stem_channels
    for s in self.arch:
        layers.append(self.stage(*s, stage_channels, rngs))
        stage_channels = s[1]
    layers.append(nnx.Sequential(
        lambda x: x.mean(axis=(1, 2)),  # global avg pooling over H, W (NHWC)
        nnx.Linear(stage_channels, self.num_classes, rngs=rngs)))
    return nnx.Sequential(*layers)

RegNet design-space evidence

Comparing error empirical distribution functions of design spaces.

RegNet narrows AnyNet with simple constraints: stage widths grow approximately linearly, bottleneck ratios stay fixed, and group widths are shared across stages. The result is a smaller search space with better probability of good models.

A RegNetX-3.2GF instance

The paper’s empirical findings collapse to: width grows linearly with stage, depth stays roughly constant, ResNeXt-style groups. A scaled-down version for Fashion-MNIST:

class RegNetX32(AnyNet):
    def __init__(self, lr=0.1, num_classes=10, in_channels=1, rngs=None):
        super().__init__(((4, 32, 16, 1), (6, 80, 16, 1)), 32,
                         lr, num_classes, in_channels, rngs)

RegNetX32().layer_summary((1, 96, 96, 1))

Sequential output shape:     (1, 48, 48, 32)
Sequential output shape:     (1, 24, 24, 32)
Sequential output shape:     (1, 12, 12, 80)
Sequential output shape:     (1, 10)

Training

model = RegNetX32(lr=0.05)
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128, resize=(96, 96))
trainer.fit(model, data)

The architecture is competitive with hand-designed ResNets at similar parameter counts, and the discovery process scales trivially with compute.

The big picture: ConvNets vs. Transformers

Vision Transformers overtook CNNs on large-scale classification around 2021.
The scaling resolution: with modern recipes and equal compute, ConvNets match ViTs (NFNets at JFT-4B scale; ConvNeXt).
The 2021 gap was mostly recipe + scale, not representational power.

The division of labor (2026)

Transformers: foundation-scale pretraining, multimodal stacks, billion-scale image-text corpora.
ConvNets: edge/latency, small data, dense prediction (nnU-Net still wins medical segmentation).
Conv stems persist inside Transformers (Whisper); diffusion moved U-Net → DiT at the frontier, conv U-Nets deployed widely.
Inductive bias is a data-efficiency dial, not a ceiling.

Recap

Modern architecture design = search over a parametric design space, not heroic engineering.
AnyNet specifies the template (stem / 4 stages / head); the empirical search picks widths, depths, and groups.
Resulting networks (RegNet) match or beat hand-designed rivals with simpler, more interpretable rules.
The same philosophy (fit simple laws to populations of models) drives today’s scaling-law-guided design.