Dense block

Densely Connected Networks (DenseNet)

DenseNet concatenates features

DenseNet (Huang et al., 2017) takes the residual idea one step further: instead of adding skip connections, concatenate them.

\mathbf{x}_\ell = f_\ell\bigl([\mathbf{x}_0, \mathbf{x}_1, \dots, \mathbf{x}_{\ell-1}]\bigr).

Every layer in a dense block sees the concatenation of all preceding outputs.

Dense block + transition

Dense block grows channels by concatenation; transition layers (1×1 conv + pool) reset channels between blocks.

Pros: maximum feature reuse, fewer parameters than ResNet for similar accuracy. Cons: memory grows linearly with depth within a block — handled by transitions.

Conv block

A small conv block (BN → ReLU → 3×3 conv) is the unit; a DenseBlock will reuse it repeatedly.

from d2l import mxnet as d2l
from mxnet import init, np, npx
from mxnet.gluon import nn
npx.set_np()

def conv_block(num_channels):
    blk = nn.Sequential()
    blk.add(nn.BatchNorm(),
            nn.Activation('relu'),
            nn.Conv2D(num_channels, kernel_size=3, padding=1))
    return blk

Now stack the conv blocks. After each block, concatenate its new features onto the running input, so later blocks see everything computed so far.

class DenseBlock(nn.Block):
    def __init__(self, num_convs, num_channels):
        super().__init__()
        self.net = nn.Sequential()
        for _ in range(num_convs):
            self.net.add(conv_block(num_channels))

    def forward(self, X):
        for blk in self.net:
            Y = blk(X)
            # Concatenate input and output of each block along the channels
            X = np.concatenate((X, Y), axis=1)
        return X

Channel growth

A DenseBlock(num_convs=2, num_channels=10) on a 3-channel input grows channels by num_convs * num_channels per block:

blk = DenseBlock(2, 10)
X = np.random.uniform(size=(4, 3, 8, 8))
blk.initialize()
Y = blk(X)
Y.shape

Transition layer

Stops the channel explosion between dense blocks: 1×1 conv halves channels, 2×2 avg-pool halves spatial dims:

def transition_block(num_channels):
    blk = nn.Sequential()
    blk.add(nn.BatchNorm(), nn.Activation('relu'),
            nn.Conv2D(num_channels, kernel_size=1),
            nn.AvgPool2D(pool_size=2, strides=2))
    return blk

blk = transition_block(10)
blk.initialize()
blk(Y).shape

The DenseNet model

A standard “stem → dense block → transition → dense block → transition → … → global avg-pool → linear” pipeline:

class DenseNet(d2l.Classifier):
    def b1(self):
        net = nn.Sequential()
        net.add(nn.Conv2D(64, kernel_size=7, strides=2, padding=3),
            nn.BatchNorm(), nn.Activation('relu'),
            nn.MaxPool2D(pool_size=3, strides=2, padding=1))
        return net

def __init__(self, num_channels=64, growth_rate=32, arch=(4, 4, 4, 4),
             lr=0.1, num_classes=10):
    super(DenseNet, self).__init__()
    self.save_hyperparameters()
    self.net = nn.Sequential()
    self.net.add(self.b1())
    for i, num_convs in enumerate(arch):
        self.net.add(DenseBlock(num_convs, growth_rate))
        # The number of output channels in the previous dense block
        num_channels += num_convs * growth_rate
        # A transition layer that halves the number of channels is added
        # between the dense blocks
        if i != len(arch) - 1:
            num_channels //= 2
            self.net.add(transition_block(num_channels))
    self.net.add(nn.BatchNorm(), nn.Activation('relu'),
                 nn.GlobalAvgPool2D(), nn.Dense(num_classes))
    self.net.initialize(init.Xavier())

Training

model = DenseNet(lr=0.01)
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128, resize=(96, 96))
trainer.fit(model, data)

DenseNet hits competitive ImageNet accuracy with far fewer parameters than equivalent ResNets — the concatenation reuse genuinely helps.

Recap

ResNet adds skip connections; DenseNet concatenates them.
Inside a dense block, layer \ell sees all of layers 0, …, \ell-1 — maximum feature reuse.
Transition layers between dense blocks rein in the channel-count explosion via 1×1 conv + pool.
Same parameter count → typically better accuracy than ResNet; same accuracy → fewer parameters.