import tensorflow as tf
from d2l import tensorflow as d2lNetwork-in-Network (Lin et al., 2014) introduces two ideas the rest of the field happily adopts:
NiN: regular conv followed by two 1×1 convs; ends in global average pool.
A regular conv followed by two 1×1 convs (with ReLU between) — the “MLP within a conv layer”:
def nin_block(out_channels, kernel_size, strides, padding):
return tf.keras.models.Sequential([
tf.keras.layers.Conv2D(out_channels, kernel_size, strides=strides,
padding=padding),
tf.keras.layers.Activation('relu'),
tf.keras.layers.Conv2D(out_channels, 1),
tf.keras.layers.Activation('relu'),
tf.keras.layers.Conv2D(out_channels, 1),
tf.keras.layers.Activation('relu')])Four NiN blocks at growing channel counts (96, 256, 384, num_classes), with max-pool downsampling between, then global average pooling + flatten → done. No FC layers.
class NiN(d2l.Classifier):
def __init__(self, lr=0.1, num_classes=10):
super().__init__()
self.save_hyperparameters()
self.net = tf.keras.models.Sequential([
nin_block(96, kernel_size=11, strides=4, padding='valid'),
tf.keras.layers.MaxPool2D(pool_size=3, strides=2),
nin_block(256, kernel_size=5, strides=1, padding='same'),
tf.keras.layers.MaxPool2D(pool_size=3, strides=2),
nin_block(384, kernel_size=3, strides=1, padding='same'),
tf.keras.layers.MaxPool2D(pool_size=3, strides=2),
tf.keras.layers.Dropout(0.5),
nin_block(num_classes, kernel_size=3, strides=1, padding='same'),
tf.keras.layers.GlobalAvgPool2D(),
tf.keras.layers.Flatten()])Walk a 1×1×224×224 input through; spatial dims shrink, channels grow until the final block produces num_classes channels:
Sequential output shape: (1, 54, 54, 96)
MaxPooling2D output shape: (1, 26, 26, 96)
Sequential output shape: (1, 26, 26, 256)
MaxPooling2D output shape: (1, 12, 12, 256)
Sequential output shape: (1, 12, 12, 384)
MaxPooling2D output shape: (1, 5, 5, 384)
Dropout output shape: (1, 5, 5, 384)
Sequential output shape: (1, 5, 5, 10)
GlobalAveragePooling2D output shape: (1, 10)
Flatten output shape: (1, 10)
Same Trainer, slightly higher learning rate than the FC nets (no dense layer to overfit on small batches):
The important comparison is parameter economy: accuracy comes from richer convolutional blocks, not a large fully connected head.