Fully Convolutional Networks

A fully convolutional network (Long, Shelhamer, Darrell 2015) is the simplest path to per-pixel prediction:

Start with a pretrained classification CNN (ResNet).
Strip the global average pool + final dense layer.
Replace with a 1×1 conv mapping to num_classes.
Upsample back to input resolution via transposed conv.

No FC layers anywhere — works on any input size, outputs a class-score map at input resolution.

Architecture

FCN: pretrained CNN body + 1×1 conv → class scores → transposed conv to upsample.

Setup

%matplotlib inline
from d2l import jax as d2l
from d2l.nnx_resnet import ResNet50
from flax import nnx
import jax
from jax import numpy as jnp
import optax
import numpy as np
from PIL import Image

Pretrained backbone

ResNet-18 pretrained on ImageNet. Drop the head (avg pool + dense); keep the conv body that produces a \frac{H}{32} \times \frac{W}{32} feature map:

pretrained_net = ResNet50.from_pretrained()
dummy = jnp.ones((1, 320, 480, 3))
print('Feature extractor output shape:',
      pretrained_net.feature_map(dummy).shape)

Downloading bytes:           |  0.00B            
Reconstructing (incomplete total...): |          |  0.00B /  0.00B            
Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]
Feature extractor output shape: (1, 10, 15, 2048)

Building the FCN

After removing the classifier head, the backbone produces a low-resolution feature map. The new FCN head must restore the original spatial resolution while changing channels to class logits.

# `feature_map` omits the global average pooling and classification head.

X = jnp.ones((1, 320, 480, 3))
pretrained_net.feature_map(X).shape

(1, 10, 15, 2048)

The class & upsampling head

1 \times 1 conv: num_features → num_classes (21 for VOC). Then a transposed conv that upsamples by 32× to recover input resolution:

num_classes = 21

class FCN(nnx.Module):
    """Fully Convolutional Network for semantic segmentation."""
    def __init__(self, backbone, num_classes, *, rngs):
        self.backbone = backbone
        self.classifier = nnx.Conv(2048, num_classes, (1, 1), rngs=rngs)
        self.upsample = nnx.ConvTranspose(
            num_classes, num_classes, (64, 64), strides=32, padding='SAME',
            use_bias=False, rngs=rngs)

    def __call__(self, X):
        X = self.backbone.feature_map(X)
        return self.upsample(self.classifier(X))

net = FCN(pretrained_net, num_classes, rngs=nnx.Rngs(0))
print('FCN output shape:',
      net(jnp.ones((1, 320, 480, 3))).shape)

FCN output shape: (1, 320, 480, 21)

Bilinear init for transposed conv

A randomly initialized 32× upsampler is hard to train. Initialize it as bilinear interpolation — a sensible starting point that fine-tunes from there:

def bilinear_kernel(in_channels, out_channels, kernel_size):
    factor = (kernel_size + 1) // 2
    if kernel_size % 2 == 1:
        center = factor - 1
    else:
        center = factor - 0.5
    og = (np.arange(kernel_size).reshape(-1, 1),
          np.arange(kernel_size).reshape(1, -1))
    filt = (1 - np.abs(og[0] - center) / factor) * \
           (1 - np.abs(og[1] - center) / factor)
    # Flax uses HWIO format for ConvTranspose kernels
    weight = np.zeros((kernel_size, kernel_size, in_channels, out_channels))
    for i in range(min(in_channels, out_channels)):
        weight[:, :, i, i] = filt
    return jnp.array(weight)

Upsampling sanity check

Apply the initialized transposed convolution to an image. The output should be larger but visually similar, because the kernel starts as bilinear interpolation rather than random noise:

class BilinearConvTranspose(nnx.Module):
    """A transposed conv layer initialized with bilinear interpolation."""
    def __init__(self, channels, kernel_size, strides, *, rngs):
        self.layer = nnx.ConvTranspose(
            channels, channels, (kernel_size, kernel_size), strides=strides,
            padding='SAME', use_bias=False, rngs=rngs)
        self.layer.kernel[...] = bilinear_kernel(
            channels, channels, kernel_size)

    def __call__(self, X):
        return self.layer(X)

conv_trans = BilinearConvTranspose(3, 4, (2, 2), rngs=nnx.Rngs(0))

img = np.array(Image.open('../img/catdog.jpg')).astype(np.float32) / 255
X = jnp.expand_dims(jnp.array(img), axis=0)  # NHWC
Y = conv_trans(X)
out_img = np.array(Y[0])

E0719 14:24:24.060179 40677 cuda_timer.cc:88] Delay kernel timed out: measured time has sub-optimal accuracy. There may be a missing warmup execution, ...

Bilinear init (cont.)

The printed shapes should confirm the spatial scale-up. Then the same bilinear kernel initializes the FCN’s final upsampling layer:

d2l.set_figsize()
print('input image shape:', img.shape)
d2l.plt.imshow(img);
print('output image shape:', out_img.shape)
d2l.plt.imshow(out_img);

input image shape: (561, 728, 3)
output image shape: (1122, 1456, 3)

# Initialize the FCN with bilinear weights for the transposed conv layer
# and Xavier initialization for the 1x1 conv layer
W = bilinear_kernel(num_classes, num_classes, 64)
net.upsample.kernel[...] = W

Loading data

# The NNX ResNet-50 is deeper than the ResNet-18 used in the PyTorch tab, so
# use a smaller minibatch while keeping the same crops and number of epochs.
batch_size, crop_size = 4, (320, 480)
train_iter, test_iter = d2l.load_data_voc(batch_size, crop_size)

read 1114 examples
read 1078 examples

Training

Pixel-level cross-entropy. Common trick: freeze the backbone, train only the new head — gets reasonable results in a few epochs:

num_epochs, lr, wd = 5, 0.001, 1e-3
def parameter_labels(params):
    return jax.tree_util.tree_map_with_path(
        lambda path, _: ('head' if any(
            getattr(p, 'key', None) in ('classifier', 'upsample')
            for p in path) else 'backbone'), params)

optimizer = nnx.Optimizer(
    net, optax.multi_transform(
        {'head': optax.chain(optax.add_decayed_weights(wd),
                             optax.sgd(lr * 10)),
         'backbone': optax.chain(optax.add_decayed_weights(wd),
                                 optax.sgd(lr))},
        parameter_labels),
    wrt=nnx.Param)
eval_net = nnx.view(net, use_running_average=True, raise_if_not_found=False)

@nnx.jit
def train_step(model, optimizer, X, Y):
    def loss_fn(model):
        logits = model(X)
        loss = optax.softmax_cross_entropy_with_integer_labels(logits, Y)
        return loss.mean(), logits
    (loss, logits), grads = nnx.value_and_grad(loss_fn, has_aux=True)(model)
    optimizer.update(model, grads)
    correct = (jnp.argmax(logits, axis=-1) == Y).sum()
    return loss, correct

@nnx.jit
def eval_step(model, X, Y):
    logits = model(X)
    losses = optax.softmax_cross_entropy_with_integer_labels(logits, Y)
    correct = (jnp.argmax(logits, axis=-1) == Y).sum()
    return losses.sum(), correct

for epoch in range(num_epochs):
    loss_terms, correct_terms, num_pixels = [], [], 0
    for X, Y in train_iter:
        X = jnp.transpose(jnp.array(X), (0, 2, 3, 1))
        Y = jnp.array(Y)
        loss, correct = train_step(net, optimizer, X, Y)
        loss_terms.append(loss * Y.size)
        correct_terms.append(correct)
        num_pixels += Y.size
    val_loss_terms, val_correct_terms, val_pixels = [], [], 0
    for X, Y in test_iter:
        X = jnp.transpose(jnp.array(X), (0, 2, 3, 1))
        Y = jnp.array(Y)
        val_loss, val_correct = eval_step(eval_net, X, Y)
        val_loss_terms.append(val_loss)
        val_correct_terms.append(val_correct)
        val_pixels += Y.size
    loss_sum = float(jnp.stack(loss_terms).sum())
    correct = int(jnp.stack(correct_terms).sum())
    val_loss_sum = float(jnp.stack(val_loss_terms).sum())
    val_correct = int(jnp.stack(val_correct_terms).sum())
    print(f'epoch {epoch + 1}, loss {loss_sum / num_pixels:.3f}, '
          f'pixel acc {correct / num_pixels:.3f}, '
          f'val loss {val_loss_sum / val_pixels:.3f}, '
          f'val pixel acc {val_correct / val_pixels:.3f}')

epoch 1, loss 1.635, pixel acc 0.706, val loss 1.202, val pixel acc 0.731
epoch 2, loss 1.132, pixel acc 0.738, val loss 1.059, val pixel acc 0.747
epoch 3, loss 1.016, pixel acc 0.757, val loss 0.974, val pixel acc 0.765
epoch 4, loss 0.924, pixel acc 0.775, val loss 0.886, val pixel acc 0.781
epoch 5, loss 0.852, pixel acc 0.790, val loss 0.826, val pixel acc 0.790

Predict

Run the network on test images, take argmax over the class dimension, map class indices back to RGB:

def predict(img):
    rgb_mean = np.array([0.485, 0.456, 0.406])
    rgb_std = np.array([0.229, 0.224, 0.225])
    X = (img.astype(np.float32) / 255 - rgb_mean) / rgb_std
    X = jnp.expand_dims(jnp.array(X), axis=0)  # NHWC
    pred = nnx.view(net, use_running_average=True,
                    raise_if_not_found=False)(X)
    return jnp.argmax(pred, axis=-1).reshape(pred.shape[1], pred.shape[2])

Visualize segmentation masks

The output grid is image, prediction, ground truth. Expect coarse boundaries: this plain FCN upsamples from a 32× downsampled feature map and has no skip connections.

def label2image(pred):
    colormap = jnp.array(d2l.VOC_COLORMAP, dtype=jnp.uint8)
    X = pred.astype(jnp.int32)
    return colormap[X, :]

voc_dir = d2l.download_extract('voc2012', 'VOCdevkit/VOC2012')
test_images, test_labels = d2l.read_voc_images(voc_dir, False)
n, imgs = 4, []
for i in range(n):
    # Crop HWC arrays: top=0, left=0, height=320, width=480
    X = test_images[i][:320, :480, :]
    pred = label2image(predict(X))
    label_crop = test_labels[i][:320, :480, :]
    imgs += [X, np.array(pred), label_crop]
d2l.show_images(imgs[::3] + imgs[1::3] + imgs[2::3], 3, n, scale=2);

Recap

FCN = pretrained classification CNN + 1×1 conv + transposed conv upsampler.
All-conv → input size doesn’t matter.
Bilinear-initialized transposed conv is the workable starting point; fine-tunes from there.
The blueprint behind U-Net (skip connections fix the blur), DeepLab (dilated convs avoid the heavy upsampling), and modern segmentation networks.