Block variants

Residual Networks (ResNet) and ResNeXt

ResNet learns residuals

ResNet (He et al., 2015) is the architecture that finally made very deep networks trainable. The key:

\mathbf{y} = f(\mathbf{x}) + \mathbf{x}.

The function only needs to learn the residual relative to identity. Identity is always representable, so adding more layers can’t hurt — 18 → 152 layers genuinely improves accuracy. Gradients flow through the skip at full strength, so deep nets train as easily as shallow ones.

Residual block

Plain block (left) vs residual block (right). Skip-add carries the input around the conv stack.

Block in code

A 2-conv block with a skip-add. Optional 1×1 conv on the skip path matches channel/stride changes:

import tensorflow as tf
from d2l import tensorflow as d2l
class Residual(tf.keras.Model):
    """The Residual block of ResNet models."""
    def __init__(self, num_channels, use_1x1conv=False, strides=1):
        super().__init__()
        self.conv1 = tf.keras.layers.Conv2D(num_channels, padding='same',
                                            kernel_size=3, strides=strides)
        self.conv2 = tf.keras.layers.Conv2D(num_channels, kernel_size=3,
                                            padding='same')
        self.conv3 = None
        # Auto-enable 1x1 conv when downsampling so the residual shape matches.
        if use_1x1conv or strides != 1:
            self.conv3 = tf.keras.layers.Conv2D(num_channels, kernel_size=1,
                                                strides=strides)
        self.bn1 = tf.keras.layers.BatchNormalization()
        self.bn2 = tf.keras.layers.BatchNormalization()

    def call(self, X):
        Y = tf.keras.activations.relu(self.bn1(self.conv1(X)))
        Y = self.bn2(self.conv2(Y))
        if self.conv3 is not None:
            X = self.conv3(X)
        Y += X
        return tf.keras.activations.relu(Y)

Same shape in, same shape out:

blk = Residual(3)
X = d2l.normal((4, 6, 6, 3))
Y = blk(X)
Y.shape
TensorShape([4, 6, 6, 3])

Halve spatial dims and double channels (transition between stages):

blk = Residual(6, use_1x1conv=True, strides=2)
blk(X).shape
TensorShape([4, 3, 3, 6])

The ResNet model

Stages of N residual blocks, with downsampling at the start of each stage:

ResNet-18: four stages of two residual blocks each, plus stem and head.

ResNet stem

The stem does early feature extraction and spatial reduction, similar to AlexNet and GoogLeNet:

class ResNet(d2l.Classifier):
    def b1(self):
        return tf.keras.models.Sequential([
            tf.keras.layers.Conv2D(64, kernel_size=7, strides=2,
                                   padding='same'),
            tf.keras.layers.BatchNormalization(),
            tf.keras.layers.Activation('relu'),
            tf.keras.layers.MaxPool2D(pool_size=3, strides=2,
                                      padding='same')])

Residual stages

A stage is a stack of residual blocks. The first block can downsample and project the skip path; later blocks keep shape.

def block(self, num_residuals, num_channels, first_block=False):
    blk = tf.keras.models.Sequential()
    for i in range(num_residuals):
        if i == 0 and not first_block:
            blk.add(Residual(num_channels, use_1x1conv=True, strides=2))
        else:
            blk.add(Residual(num_channels))
    return blk

ResNet head

After the residual stages, global average pooling collapses the spatial map and the final linear layer predicts classes.

def __init__(self, arch, lr=0.1, num_classes=10):
    super(ResNet, self).__init__()
    self.save_hyperparameters()
    self.net = self.b1()
    for i, b in enumerate(arch):
        self.net.add(self.block(*b, first_block=(i==0)))
    self.net.add(tf.keras.models.Sequential([
        tf.keras.layers.GlobalAvgPool2D(),
        tf.keras.layers.Dense(units=num_classes)]))

ResNet-18 assembly

Four stages × 2 residual blocks each — same template defines ResNet-34/50/101/152:

class ResNet18(ResNet):
    def __init__(self, lr=0.1, num_classes=10):
        super().__init__(((2, 64), (2, 128), (2, 256), (2, 512)),
                       lr, num_classes)
ResNet18().layer_summary((1, 96, 96, 1))
Conv2D output shape:     (1, 48, 48, 64)
BatchNormalization output shape:     (1, 48, 48, 64)
Activation output shape:     (1, 48, 48, 64)
MaxPooling2D output shape:   (1, 24, 24, 64)
Sequential output shape:     (1, 24, 24, 64)
Sequential output shape:     (1, 12, 12, 128)
Sequential output shape:     (1, 6, 6, 256)
Sequential output shape:     (1, 3, 3, 512)
Sequential output shape:     (1, 10)

Training

trainer = d2l.Trainer(max_epochs=10)
data = d2l.FashionMNIST(batch_size=128, resize=(96, 96))
with d2l.try_gpu():
    model = ResNet18(lr=0.01)
    trainer.fit(model, data)

The notebook trains a compact ResNet-18 variant on Fashion-MNIST; the point is to validate that the residual-stage template plugs into the same Trainer used by earlier CNNs.

ResNeXt: width via cardinality

A cleaner variant: each block has multiple parallel paths (cardinality C) instead of one wide one — same parameter budget, better accuracy:

class ResNeXtBlock(tf.keras.Model):
    """The ResNeXt block."""
    def __init__(self, num_channels, groups, bot_mul, use_1x1conv=False,
                 strides=1):
        super().__init__()
        bot_channels = int(round(num_channels * bot_mul))
        self.conv1 = tf.keras.layers.Conv2D(bot_channels, 1, strides=1)
        self.conv2 = tf.keras.layers.Conv2D(bot_channels, 3, strides=strides,
                                            padding="same",
                                            groups=groups)
        self.conv3 = tf.keras.layers.Conv2D(num_channels, 1, strides=1)
        self.bn1 = tf.keras.layers.BatchNormalization()
        self.bn2 = tf.keras.layers.BatchNormalization()
        self.bn3 = tf.keras.layers.BatchNormalization()
        if use_1x1conv:
            self.conv4 = tf.keras.layers.Conv2D(num_channels, 1,
                                                strides=strides)
            self.bn4 = tf.keras.layers.BatchNormalization()
        else:
            self.conv4 = None

    def call(self, X):
        Y = tf.keras.activations.relu(self.bn1(self.conv1(X)))
        Y = tf.keras.activations.relu(self.bn2(self.conv2(Y)))
        Y = self.bn3(self.conv3(Y))
        if self.conv4:
            X = self.bn4(self.conv4(X))
        return tf.keras.activations.relu(Y + X)

Grouped-conv savings

Grouped convolution cuts the expensive 3×3 channel mixing by a factor of groups, while surrounding 1×1 convolutions let information mix before and after the grouped work.

blk = ResNeXtBlock(32, 16, 1)
X = d2l.normal((4, 96, 96, 32))
Y = blk(X)
Y.shape
TensorShape([4, 96, 96, 32])

Recap

  • Residual connection: \mathbf{y} = f(\mathbf{x}) + \mathbf{x} — guarantees identity is always representable.
  • Trains networks arbitrarily deep (152, 1000+) without optimization pathologies.
  • The “residual block” + “stage” template is universal — used in vision (ResNet, ResNeXt, DenseNet), language (Transformers, all use residual + LayerNorm), and beyond.
  • ResNet-50 is the default ImageNet backbone for transfer learning even a decade later.