Fine-Tuning

You’ll rarely train a vision model from scratch. Transfer learning — start from weights pretrained on a big dataset (ImageNet) and adapt to your small one — is the default recipe.

Fine-tuning: pretrained backbone + new task-specific head.

The standard recipe

Take a pretrained network (ResNet, ViT, etc.).
Replace the output layer with a head for your task.
Optionally freeze early layers; train the rest.
Small LR on the pretrained part, larger LR on the new head.

Setup

%matplotlib inline
from d2l import mxnet as d2l
from mxnet import gluon, init, np, npx
from mxnet.gluon import nn
import os

npx.set_np()

The hot-dog dataset

A tiny binary classification dataset (hot dog / not hot dog) — too small to train a CNN from scratch, perfect for transfer learning:

d2l.DATA_HUB['hotdog'] = (d2l.DATA_URL + 'hotdog.zip', 
                         'fba480ffa8aa7e0febbb511d181409f899b9baa5')

data_dir = d2l.download_extract('hotdog')

train_imgs = gluon.data.vision.ImageFolderDataset(
    os.path.join(data_dir, 'train'))
test_imgs = gluon.data.vision.ImageFolderDataset(
    os.path.join(data_dir, 'test'))

hotdogs = [train_imgs[i][0] for i in range(8)]
not_hotdogs = [train_imgs[-i - 1][0] for i in range(8)]
d2l.show_images(hotdogs + not_hotdogs, 2, 8, scale=1.4);

Augmentation pipelines

Standard ImageNet recipe — random resized crop + flip for training, center crop for eval. Match the preprocessing convention that the pretrained model expects:

# Specify the means and standard deviations of the three RGB channels to
# standardize each channel
normalize = gluon.data.vision.transforms.Normalize(
    [0.485, 0.456, 0.406], [0.229, 0.224, 0.225])

train_augs = gluon.data.vision.transforms.Compose([
    gluon.data.vision.transforms.RandomResizedCrop(224),
    gluon.data.vision.transforms.RandomFlipLeftRight(),
    gluon.data.vision.transforms.ToTensor(),
    normalize])

test_augs = gluon.data.vision.transforms.Compose([
    gluon.data.vision.transforms.Resize(256),
    gluon.data.vision.transforms.CenterCrop(224),
    gluon.data.vision.transforms.ToTensor(),
    normalize])

Inspect the pretrained head

The source model was trained for 1000 ImageNet classes. Its convolutional body is reusable; the final classifier is task-specific and will be replaced:

pretrained_net = gluon.model_zoo.vision.resnet18_v2(pretrained=True)

Replace the task head

Create a target model with the same pretrained backbone and a randomly initialized 2-way classifier for hot dog vs. not hot dog:

pretrained_net.output

Dense(512 -> 1000, linear)

Discriminative learning rates

Let \theta_b be pretrained backbone parameters and \theta_h the new head. Use a small step on \theta_b and a larger one on \theta_h:

\eta_b = \eta,\qquad \eta_h = 10\eta.

finetune_net = gluon.model_zoo.vision.resnet18_v2(classes=2)
finetune_net.features = pretrained_net.features
finetune_net.output.initialize(init.Xavier())
# The model parameters in the output layer will be iterated using a learning
# rate ten times greater
for p in finetune_net.output.collect_params().values():
    p.lr_mult = 10

Training helper

The helper hides framework details: parameter groups, optimizer construction, metric logging, and the scratch/fine-tune switch. The four-step pattern is:

build the pretrained backbone and new head;
assign a small learning rate to backbone parameters;
assign a larger learning rate to the randomly initialized head;
train and compare against a scratch baseline.

Run fine-tuning

With matched ImageNet preprocessing and a small base LR, the pretrained model should reach useful accuracy quickly. The point is not just a better final score; it is much less data and compute than training the same network cold.

# lr divided by batch_size: gluon Trainer no longer rescales (issue 7 fix in d2l.train_batch_ch13)
train_fine_tuning(finetune_net, 7.8125e-5)

loss 0.279, train acc 0.915, test acc 0.921
282.7 examples/sec on [gpu(0)]

From-scratch baseline

Same architecture, no pretraining. Much worse on this small dataset — illustrates why transfer learning is the default:

scratch_net = gluon.model_zoo.vision.resnet18_v2(classes=2)
scratch_net.initialize(init=init.Xavier())
# lr divided by batch_size: gluon Trainer no longer rescales (issue 7 fix in d2l.train_batch_ch13)
train_fine_tuning(scratch_net, 7.8125e-4)

loss 0.348, train acc 0.862, test acc 0.828
513.3 examples/sec on [gpu(0)]

What to vary

The natural ablations are: freeze more or fewer layers, change the backbone/head learning-rate ratio, and compare against the source ImageNet “hotdog” class weights.

for p in finetune_net.features.collect_params().values():
    p.grad_req = 'null'

weight = pretrained_net.output.weight
hotdog_w = np.split(weight.data(), 1000, axis=0)[713]
hotdog_w.shape

(1, 512)

Recap

Transfer learning: pretrained backbone + new head; almost always beats from-scratch on small / medium datasets.
Use small LR on the backbone (10×–100× smaller than the head LR) — pretrained features need only nudges.
Match input preprocessing (mean/std normalization, input size, or model-specific preprocess_input) to what the pretrained model expects.
Modern variants: feature-extractor mode (freeze everything but head), full fine-tune (everything trains), parameter-efficient methods (LoRA, adapters).