Residual Networks: ResNet, ResNeXt, and DenseNet

ResNet learns residuals

ResNet (He et al., 2015) is the architecture that finally made very deep networks trainable. The key:

\mathbf{y} = f(\mathbf{x}) + \mathbf{x}.

The function only needs to learn the residual relative to identity. Identity is always representable, so adding more layers can’t hurt: 18 → 152 layers genuinely improves accuracy. Gradients flow through the skip at full strength, so deep nets train as easily as shallow ones.

Residual block

The two block variants: identity skip when shapes match, 1×1 projection on the skip path when channels or resolution change.

Block in code

A 2-conv block with a skip-add. Optional 1×1 conv on the skip path matches channel/stride changes:

from d2l import jax as d2l
from flax import nnx
from jax import numpy as jnp
import jax

class Residual(nnx.Module):
    """The Residual block of ResNet models."""
    def __init__(self, num_channels, use_1x1conv=False, strides=(1, 1),
                 in_channels=None, rngs=None):
        in_channels = num_channels if in_channels is None else in_channels
        rngs = nnx.Rngs(d2l.get_key()) if rngs is None else rngs
        self.conv1 = nnx.Conv(in_channels, num_channels, kernel_size=(3, 3),
                              padding='same', strides=strides, rngs=rngs)
        self.conv2 = nnx.Conv(num_channels, num_channels, kernel_size=(3, 3),
                              padding='same', rngs=rngs)
        # Auto-enable 1x1 conv when downsampling so the residual shape matches.
        if use_1x1conv or any(s != 1 for s in strides):
            self.conv3 = nnx.Conv(in_channels, num_channels,
                                  kernel_size=(1, 1), strides=strides,
                                  rngs=rngs)
        else:
            self.conv3 = None
        self.bn1 = nnx.BatchNorm(num_channels, rngs=rngs)
        self.bn2 = nnx.BatchNorm(num_channels, rngs=rngs)

    def __call__(self, X):
        Y = nnx.relu(self.bn1(self.conv1(X)))
        Y = self.bn2(self.conv2(Y))
        if self.conv3:
            X = self.conv3(X)
        Y += X
        return nnx.relu(Y)

Block variants

Same shape in, same shape out:

blk = Residual(3)
X = jax.random.normal(d2l.get_key(), (4, 6, 6, 3))
blk(X).shape

(4, 6, 6, 3)

Halve spatial dims and double channels (transition between stages):

blk = Residual(6, use_1x1conv=True, strides=(2, 2), in_channels=3)
blk(X).shape

(4, 3, 3, 6)

The ResNet model

Stages of N residual blocks, with downsampling at the start of each stage:

ResNet-18: four stages of two residual blocks each, plus stem and head.

ResNet stem

The stem does early feature extraction and spatial reduction, similar to AlexNet and GoogLeNet:

class ResNet(d2l.Classifier):
    def __init__(self, arch, lr=0.1, num_classes=10, in_channels=1, rngs=None):
        super().__init__()
        self.save_hyperparameters(ignore=['rngs'])
        rngs = nnx.Rngs(d2l.get_key()) if rngs is None else rngs
        self.net = self.create_net(in_channels, rngs)

    def b1(self, in_channels, rngs):
        return nnx.Sequential(
            nnx.Conv(in_channels, 64, kernel_size=(7, 7), strides=(2, 2),
                     padding='same', rngs=rngs),
            nnx.BatchNorm(64, rngs=rngs), nnx.relu,
            lambda x: nnx.max_pool(x, window_shape=(3, 3), strides=(2, 2),
                                   padding='same'))

Residual stages

A stage is a stack of residual blocks. The first block can downsample and project the skip path; later blocks keep shape.

def block(self, num_residuals, num_channels, in_channels,
          first_block=False, rngs=None):
    blk = []
    for i in range(num_residuals):
        if i == 0 and not first_block:
            blk.append(Residual(num_channels, use_1x1conv=True,
                                strides=(2, 2), in_channels=in_channels,
                                rngs=rngs))
        else:
            blk.append(Residual(num_channels, in_channels=in_channels,
                                rngs=rngs))
        in_channels = num_channels
    return nnx.Sequential(*blk)

ResNet head

After the residual stages, global average pooling collapses the spatial map and the final linear layer predicts classes.

def create_net(self, in_channels, rngs):
    layers = [self.b1(in_channels, rngs)]
    stage_channels = 64
    for i, (num_residuals, num_channels) in enumerate(self.arch):
        layers.append(self.block(num_residuals, num_channels, stage_channels,
                                 first_block=(i == 0), rngs=rngs))
        stage_channels = num_channels
    layers.append(nnx.Sequential(
        lambda x: x.mean(axis=(1, 2)),  # global avg pooling over H, W (NHWC)
        nnx.Linear(stage_channels, self.num_classes, rngs=rngs)))
    return nnx.Sequential(*layers)

ResNet-18 assembly

Four stages × 2 residual blocks each; the same template defines ResNet-34/50/101/152:

class ResNet18(ResNet):
    def __init__(self, lr=0.1, num_classes=10, in_channels=1, rngs=None):
        super().__init__(((2, 64), (2, 128), (2, 256), (2, 512)),
                         lr, num_classes, in_channels, rngs)

ResNet18().layer_summary((1, 96, 96, 1))

Sequential output shape:     (1, 24, 24, 64)
Sequential output shape:     (1, 24, 24, 64)
Sequential output shape:     (1, 12, 12, 128)
Sequential output shape:     (1, 6, 6, 256)
Sequential output shape:     (1, 3, 3, 512)
Sequential output shape:     (1, 10)

Training

model = ResNet18(lr=0.01)
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128, resize=(96, 96))
trainer.fit(model, data)

The notebook trains a compact ResNet-18 variant on Fashion-MNIST; the point is to validate that the residual-stage template plugs into the same Trainer used by earlier CNNs.

ResNeXt: width via cardinality

A cleaner variant: each block has multiple parallel paths (cardinality C) instead of one wide one, with the same parameter budget and better accuracy:

class ResNeXtBlock(nnx.Module):
    """The ResNeXt block."""
    def __init__(self, num_channels, groups, bot_mul, use_1x1conv=False,
                 strides=(1, 1), in_channels=None, rngs=None):
        in_channels = num_channels if in_channels is None else in_channels
        rngs = nnx.Rngs(d2l.get_key()) if rngs is None else rngs
        bot_channels = int(round(num_channels * bot_mul))
        self.conv1 = nnx.Conv(in_channels, bot_channels, kernel_size=(1, 1),
                              strides=(1, 1), rngs=rngs)
        self.conv2 = nnx.Conv(bot_channels, bot_channels,
                              kernel_size=(3, 3), strides=strides,
                              padding='same', feature_group_count=groups,
                              rngs=rngs)
        self.conv3 = nnx.Conv(bot_channels, num_channels,
                              kernel_size=(1, 1), strides=(1, 1), rngs=rngs)
        self.bn1 = nnx.BatchNorm(bot_channels, rngs=rngs)
        self.bn2 = nnx.BatchNorm(bot_channels, rngs=rngs)
        self.bn3 = nnx.BatchNorm(num_channels, rngs=rngs)
        if use_1x1conv:
            self.conv4 = nnx.Conv(in_channels, num_channels,
                                  kernel_size=(1, 1), strides=strides,
                                  rngs=rngs)
            self.bn4 = nnx.BatchNorm(num_channels, rngs=rngs)
        else:
            self.conv4 = None

    def __call__(self, X):
        Y = nnx.relu(self.bn1(self.conv1(X)))
        Y = nnx.relu(self.bn2(self.conv2(Y)))
        Y = self.bn3(self.conv3(Y))
        if self.conv4:
            X = self.bn4(self.conv4(X))
        return nnx.relu(Y + X)

Grouped-conv savings

Grouped convolution cuts the expensive 3×3 channel mixing by a factor of groups, while surrounding 1×1 convolutions let information mix before and after the grouped work.

blk = ResNeXtBlock(32, 16, 1)
X = jnp.zeros((4, 96, 96, 32))
blk(X).shape

(4, 96, 96, 32)

DenseNet: concatenate instead of add

DenseNet (Huang et al., 2017) keeps more than two Taylor terms: instead of adding a layer’s output to its input, concatenate them along the channel dimension.

\mathbf{x}_\ell = f_\ell\bigl(\left[\mathbf{x}_0, \mathbf{x}_1, \ldots, \mathbf{x}_{\ell-1}\right]\bigr).

Addition keeps channels fixed; concatenation grows them, and every layer sees all earlier features.

Dense blocks in code

A conv block (BN → ReLU → 3×3 conv) is the unit; a dense block stacks them, concatenating each output onto the running input:

class ConvBlock(nnx.Module):
    def __init__(self, in_channels, num_channels, rngs):
        self.bn = nnx.BatchNorm(in_channels, rngs=rngs)
        self.conv = nnx.Conv(in_channels, num_channels, kernel_size=(3, 3),
                             padding=(1, 1), rngs=rngs)

    def __call__(self, X):
        return self.conv(nnx.relu(self.bn(X)))

class DenseBlock(nnx.Module):
    def __init__(self, num_convs, num_channels, in_channels=3, rngs=None):
        rngs = nnx.Rngs(d2l.get_key()) if rngs is None else rngs
        layers = []
        for i in range(num_convs):
            layers.append(ConvBlock(
                in_channels + i * num_channels, num_channels, rngs))
        self.layers = nnx.List(layers)

    def __call__(self, X):
        for layer in self.layers:
            Y = layer(X)
            # Concatenate input and output of each block along the channels
            X = jnp.concatenate((X, Y), axis=-1)
        return X

Transition layers, and why addition won

Each dense block grows channels by num_convs * num_channels; a transition layer (1×1 conv + 2×2 avg-pool) shrinks them back:

class TransitionBlock(nnx.Module):
    def __init__(self, in_channels, num_channels, rngs=None):
        rngs = nnx.Rngs(d2l.get_key()) if rngs is None else rngs
        self.bn = nnx.BatchNorm(in_channels, rngs=rngs)
        self.conv = nnx.Conv(in_channels, num_channels,
                             kernel_size=(1, 1), rngs=rngs)

    def __call__(self, X):
        X = self.conv(nnx.relu(self.bn(X)))
        X = nnx.avg_pool(X, window_shape=(2, 2), strides=(2, 2))
        return X

Feature reuse makes DenseNet parameter-efficient, but every concatenated map must stay in memory for later layers. That memory bill is why addition won at scale.

Recap

Residual connection: \mathbf{y} = f(\mathbf{x}) + \mathbf{x}, which guarantees identity is always representable.
Trains networks arbitrarily deep (152, 1000+) without optimization pathologies.
ResNeXt adds cardinality: grouped 3×3 conv between 1×1 mixers.
DenseNet concatenates instead of adds: maximal feature reuse, fewer parameters, but a memory bill that addition avoids.
The “residual block” + “stage” template is universal: used in vision (ResNet, ResNeXt), language (Transformers, all use residual + LayerNorm), and beyond.
ResNet-50 is the default ImageNet backbone for transfer learning even a decade later.