GoogLeNet stem and early stages

Multi-Branch Networks (GoogLeNet)

GoogLeNet goes wide

GoogLeNet (Szegedy et al., 2014) — winner of ImageNet 2014 — introduces a different design axis: width, not just depth.

Each layer is an Inception block that runs multiple filter sizes in parallel (1×1, 3×3, 5×5, plus pool) and concatenates their outputs. The network can choose, layer by layer, which scale of filter is most useful.

Heavy use of 1×1 convs as bottleneck reductions keeps the parameter count manageable despite the multi-branch design.

Inception block

Four parallel branches at the same spatial size, concatenated along the channel axis:

Inception: four parallel branches, channel-concatenated.

The four branches

  • 1: 1×1 conv (small filter only)
  • 2: 1×1 conv → 3×3 conv (with bottleneck)
  • 3: 1×1 conv → 5×5 conv (with bottleneck)
  • 4: 3×3 max-pool → 1×1 conv
from d2l import mxnet as d2l
from mxnet import np, npx, init
from mxnet.gluon import nn
npx.set_np()
class Inception(nn.Block):
    # c1--c4 are the number of output channels for each branch
    def __init__(self, c1, c2, c3, c4):
        super().__init__()
        # Branch 1
        self.b1_1 = nn.Conv2D(c1, kernel_size=1, activation='relu')
        # Branch 2
        self.b2_1 = nn.Conv2D(c2[0], kernel_size=1, activation='relu')
        self.b2_2 = nn.Conv2D(c2[1], kernel_size=3, padding=1,
                              activation='relu')
        # Branch 3
        self.b3_1 = nn.Conv2D(c3[0], kernel_size=1, activation='relu')
        self.b3_2 = nn.Conv2D(c3[1], kernel_size=5, padding=2,
                              activation='relu')
        # Branch 4
        self.b4_1 = nn.MaxPool2D(pool_size=3, strides=1, padding=1)
        self.b4_2 = nn.Conv2D(c4, kernel_size=1, activation='relu')

    def forward(self, x):
        b1 = self.b1_1(x)
        b2 = self.b2_2(self.b2_1(x))
        b3 = self.b3_2(self.b3_1(x))
        b4 = self.b4_2(self.b4_1(x))
        return np.concatenate((b1, b2, b3, b4), axis=1)

Five sequential “stages” — each a small stack of conv + pool + inception modules — built up methodically. The stem and second stage reduce resolution quickly before the Inception blocks take over:

class GoogleNet(d2l.Classifier):
    def b1(self):
        net = nn.Sequential()
        net.add(nn.Conv2D(64, kernel_size=7, strides=2, padding=3,
                          activation='relu'),
                nn.MaxPool2D(pool_size=3, strides=2, padding=1))
        return net
def b2(self):
    net = nn.Sequential()
    net.add(nn.Conv2D(64, kernel_size=1, activation='relu'),
           nn.Conv2D(192, kernel_size=3, padding=1, activation='relu'),
           nn.MaxPool2D(pool_size=3, strides=2, padding=1))
    return net

First Inception stack

Stage 3 introduces the repeating pattern: two Inception blocks, then pooling. Channel counts are split across branches, then concatenated back together.

def b3(self):
    net = nn.Sequential()
    net.add(Inception(64, (96, 128), (16, 32), 32),
           Inception(128, (128, 192), (32, 96), 64),
           nn.MaxPool2D(pool_size=3, strides=2, padding=1))
    return net

Deep Inception stages

Stage 4 is the compute-heavy middle of the network: five Inception blocks before the next spatial downsample.

def b4(self):
    net = nn.Sequential()
    net.add(Inception(192, (96, 208), (16, 48), 64),
            Inception(160, (112, 224), (24, 64), 64),
            Inception(128, (128, 256), (24, 64), 64),
            Inception(112, (144, 288), (32, 64), 64),
            Inception(256, (160, 320), (32, 128), 128),
            nn.MaxPool2D(pool_size=3, strides=2, padding=1))
    return net

Head and assembly

Stage 5 uses global average pooling before the final classifier, then __init__ simply wires b1 through b5 together.

def b5(self):
    net = nn.Sequential()
    net.add(Inception(256, (160, 320), (32, 128), 128),
            Inception(384, (192, 384), (48, 128), 128),
            nn.GlobalAvgPool2D())
    return net
def __init__(self, lr=0.1, num_classes=10):
    super(GoogleNet, self).__init__()
    self.save_hyperparameters()
    self.net = nn.Sequential()
    self.net.add(self.b1(), self.b2(), self.b3(), self.b4(), self.b5(),
                 nn.Dense(num_classes))
    self.net.initialize(init.Xavier())

Shape inspection

For Fashion-MNIST we shrink the input to 96×96 to keep training time reasonable; layer summary on the smaller input:

model = GoogleNet().layer_summary((1, 1, 96, 96))

Notice the pattern: spatial resolution falls at pools, while channel depth grows after concatenating each Inception block’s branches.

Training

model = GoogleNet(lr=0.01)
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128, resize=(96, 96))
trainer.fit(model, data)

The original GoogLeNet has 22 weighted layers (~7M params) — far fewer than VGG (~138M) — yet better ImageNet accuracy.

Recap

  • Inception block = multi-branch, multi-scale, concatenated. The network learns which filter size matters per layer.
  • 1×1 bottlenecks keep parameter count low.
  • The “go wider, not just deeper” lesson informs every modern attention/feature-pyramid design.
  • GoogLeNet’s descendants (Inception-v3/v4, Xception) refined the block; the underlying multi-branch + bottleneck template endures.