import tensorflow as tf
from d2l import tensorflow as d2lResNet (He et al., 2015) is the architecture that finally made very deep networks trainable. The key:
\mathbf{y} = f(\mathbf{x}) + \mathbf{x}.
The function only needs to learn the residual relative to identity. Identity is always representable, so adding more layers can’t hurt — 18 → 152 layers genuinely improves accuracy. Gradients flow through the skip at full strength, so deep nets train as easily as shallow ones.
Plain block (left) vs residual block (right). Skip-add carries the input around the conv stack.
A 2-conv block with a skip-add. Optional 1×1 conv on the skip path matches channel/stride changes:
class Residual(tf.keras.Model):
"""The Residual block of ResNet models."""
def __init__(self, num_channels, use_1x1conv=False, strides=1):
super().__init__()
self.conv1 = tf.keras.layers.Conv2D(num_channels, padding='same',
kernel_size=3, strides=strides)
self.conv2 = tf.keras.layers.Conv2D(num_channels, kernel_size=3,
padding='same')
self.conv3 = None
# Auto-enable 1x1 conv when downsampling so the residual shape matches.
if use_1x1conv or strides != 1:
self.conv3 = tf.keras.layers.Conv2D(num_channels, kernel_size=1,
strides=strides)
self.bn1 = tf.keras.layers.BatchNormalization()
self.bn2 = tf.keras.layers.BatchNormalization()
def call(self, X):
Y = tf.keras.activations.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3 is not None:
X = self.conv3(X)
Y += X
return tf.keras.activations.relu(Y)Same shape in, same shape out:
TensorShape([4, 6, 6, 3])
Stages of N residual blocks, with downsampling at the start of each stage:
ResNet-18: four stages of two residual blocks each, plus stem and head.
The stem does early feature extraction and spatial reduction, similar to AlexNet and GoogLeNet:
A stage is a stack of residual blocks. The first block can downsample and project the skip path; later blocks keep shape.
After the residual stages, global average pooling collapses the spatial map and the final linear layer predicts classes.
def __init__(self, arch, lr=0.1, num_classes=10):
super(ResNet, self).__init__()
self.save_hyperparameters()
self.net = self.b1()
for i, b in enumerate(arch):
self.net.add(self.block(*b, first_block=(i==0)))
self.net.add(tf.keras.models.Sequential([
tf.keras.layers.GlobalAvgPool2D(),
tf.keras.layers.Dense(units=num_classes)]))Four stages × 2 residual blocks each — same template defines ResNet-34/50/101/152:
Conv2D output shape: (1, 48, 48, 64)
BatchNormalization output shape: (1, 48, 48, 64)
Activation output shape: (1, 48, 48, 64)
MaxPooling2D output shape: (1, 24, 24, 64)
Sequential output shape: (1, 24, 24, 64)
Sequential output shape: (1, 12, 12, 128)
Sequential output shape: (1, 6, 6, 256)
Sequential output shape: (1, 3, 3, 512)
Sequential output shape: (1, 10)
The notebook trains a compact ResNet-18 variant on Fashion-MNIST; the point is to validate that the residual-stage template plugs into the same Trainer used by earlier CNNs.
A cleaner variant: each block has multiple parallel paths (cardinality C) instead of one wide one — same parameter budget, better accuracy:
class ResNeXtBlock(tf.keras.Model):
"""The ResNeXt block."""
def __init__(self, num_channels, groups, bot_mul, use_1x1conv=False,
strides=1):
super().__init__()
bot_channels = int(round(num_channels * bot_mul))
self.conv1 = tf.keras.layers.Conv2D(bot_channels, 1, strides=1)
self.conv2 = tf.keras.layers.Conv2D(bot_channels, 3, strides=strides,
padding="same",
groups=groups)
self.conv3 = tf.keras.layers.Conv2D(num_channels, 1, strides=1)
self.bn1 = tf.keras.layers.BatchNormalization()
self.bn2 = tf.keras.layers.BatchNormalization()
self.bn3 = tf.keras.layers.BatchNormalization()
if use_1x1conv:
self.conv4 = tf.keras.layers.Conv2D(num_channels, 1,
strides=strides)
self.bn4 = tf.keras.layers.BatchNormalization()
else:
self.conv4 = None
def call(self, X):
Y = tf.keras.activations.relu(self.bn1(self.conv1(X)))
Y = tf.keras.activations.relu(self.bn2(self.conv2(Y)))
Y = self.bn3(self.conv3(Y))
if self.conv4:
X = self.bn4(self.conv4(X))
return tf.keras.activations.relu(Y + X)Grouped convolution cuts the expensive 3×3 channel mixing by a factor of groups, while surrounding 1×1 convolutions let information mix before and after the grouped work.
TensorShape([4, 96, 96, 32])