from d2l import torch as d2l
import torch
from torch import nn
from torch.nn import functional as FGoogLeNet (Szegedy et al., 2014) — winner of ImageNet 2014 — introduces a different design axis: width, not just depth.
Each layer is an Inception block that runs multiple filter sizes in parallel (1×1, 3×3, 5×5, plus pool) and concatenates their outputs. The network can choose, layer by layer, which scale of filter is most useful.
Heavy use of 1×1 convs as bottleneck reductions keeps the parameter count manageable despite the multi-branch design.
Four parallel branches at the same spatial size, concatenated along the channel axis:
Inception: four parallel branches, channel-concatenated.
class Inception(nn.Module):
# c1--c4 are the number of output channels for each branch
def __init__(self, c1, c2, c3, c4, **kwargs):
super(Inception, self).__init__(**kwargs)
# Branch 1
self.b1_1 = nn.LazyConv2d(c1, kernel_size=1)
# Branch 2
self.b2_1 = nn.LazyConv2d(c2[0], kernel_size=1)
self.b2_2 = nn.LazyConv2d(c2[1], kernel_size=3, padding=1)
# Branch 3
self.b3_1 = nn.LazyConv2d(c3[0], kernel_size=1)
self.b3_2 = nn.LazyConv2d(c3[1], kernel_size=5, padding=2)
# Branch 4
self.b4_1 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)
self.b4_2 = nn.LazyConv2d(c4, kernel_size=1)
def forward(self, x):
b1 = F.relu(self.b1_1(x))
b2 = F.relu(self.b2_2(F.relu(self.b2_1(x))))
b3 = F.relu(self.b3_2(F.relu(self.b3_1(x))))
b4 = F.relu(self.b4_2(self.b4_1(x)))
return torch.cat((b1, b2, b3, b4), dim=1)Five sequential “stages” — each a small stack of conv + pool + inception modules — built up methodically. The stem and second stage reduce resolution quickly before the Inception blocks take over:
Stage 3 introduces the repeating pattern: two Inception blocks, then pooling. Channel counts are split across branches, then concatenated back together.
Stage 4 is the compute-heavy middle of the network: five Inception blocks before the next spatial downsample.
Stage 5 uses global average pooling before the final classifier, then __init__ simply wires b1 through b5 together.
For Fashion-MNIST we shrink the input to 96×96 to keep training time reasonable; layer summary on the smaller input:
Sequential output shape: torch.Size([1, 64, 24, 24])
Sequential output shape: torch.Size([1, 192, 12, 12])
Sequential output shape: torch.Size([1, 480, 6, 6])
Sequential output shape: torch.Size([1, 832, 3, 3])
Sequential output shape: torch.Size([1, 1024])
Linear output shape: torch.Size([1, 10])
Notice the pattern: spatial resolution falls at pools, while channel depth grows after concatenating each Inception block’s branches.
The original GoogLeNet has 22 weighted layers (~7M params) — far fewer than VGG (~138M) — yet better ImageNet accuracy.