Fine-Tuning

You’ll rarely train a vision model from scratch. Transfer learning — start from weights pretrained on a big dataset (ImageNet) and adapt to your small one — is the default recipe.

Fine-tuning: pretrained backbone + new task-specific head.

The standard recipe

Take a pretrained network (ResNet, ViT, etc.).
Replace the output layer with a head for your task.
Optionally freeze early layers; train the rest.
Small LR on the pretrained part, larger LR on the new head.

Setup

%matplotlib inline
from d2l import torch as d2l
from torch import nn
import torch
import torchvision
import os

The hot-dog dataset

A tiny binary classification dataset (hot dog / not hot dog) — too small to train a CNN from scratch, perfect for transfer learning:

d2l.DATA_HUB['hotdog'] = (d2l.DATA_URL + 'hotdog.zip', 
                         'fba480ffa8aa7e0febbb511d181409f899b9baa5')

data_dir = d2l.download_extract('hotdog')

train_imgs = torchvision.datasets.ImageFolder(os.path.join(data_dir, 'train'))
test_imgs = torchvision.datasets.ImageFolder(os.path.join(data_dir, 'test'))

hotdogs = [train_imgs[i][0] for i in range(8)]
not_hotdogs = [train_imgs[-i - 1][0] for i in range(8)]
d2l.show_images(hotdogs + not_hotdogs, 2, 8, scale=1.4);

Augmentation pipelines

Standard ImageNet recipe — random resized crop + flip for training, center crop for eval. Match the preprocessing convention that the pretrained model expects:

# Specify the means and standard deviations of the three RGB channels to
# standardize each channel
normalize = torchvision.transforms.Normalize(
    [0.485, 0.456, 0.406], [0.229, 0.224, 0.225])

train_augs = torchvision.transforms.Compose([
    torchvision.transforms.RandomResizedCrop(224),
    torchvision.transforms.RandomHorizontalFlip(),
    torchvision.transforms.ToTensor(),
    normalize])

test_augs = torchvision.transforms.Compose([
    torchvision.transforms.Resize([256, 256]),
    torchvision.transforms.CenterCrop(224),
    torchvision.transforms.ToTensor(),
    normalize])

Inspect the pretrained head

The source model was trained for 1000 ImageNet classes. Its convolutional body is reusable; the final classifier is task-specific and will be replaced:

pretrained_net = torchvision.models.resnet18(
    weights=torchvision.models.ResNet18_Weights.DEFAULT)

Replace the task head

Create a target model with the same pretrained backbone and a randomly initialized 2-way classifier for hot dog vs. not hot dog:

pretrained_net.fc

Linear(in_features=512, out_features=1000, bias=True)

Discriminative learning rates

Let \theta_b be pretrained backbone parameters and \theta_h the new head. Use a small step on \theta_b and a larger one on \theta_h:

\eta_b = \eta,\qquad \eta_h = 10\eta.

finetune_net = torchvision.models.resnet18(
    weights=torchvision.models.ResNet18_Weights.DEFAULT)
finetune_net.fc = nn.Linear(finetune_net.fc.in_features, 2)
nn.init.xavier_uniform_(finetune_net.fc.weight);

Training helper

The helper hides framework details: parameter groups, optimizer construction, metric logging, and the scratch/fine-tune switch. The four-step pattern is:

build the pretrained backbone and new head;
assign a small learning rate to backbone parameters;
assign a larger learning rate to the randomly initialized head;
train and compare against a scratch baseline.

Run fine-tuning

With matched ImageNet preprocessing and a small base LR, the pretrained model should reach useful accuracy quickly. The point is not just a better final score; it is much less data and compute than training the same network cold.

train_fine_tuning(finetune_net, 5e-5)

loss 0.163, train acc 0.940, test acc 0.943
1730.9 examples/sec on [device(type='cuda', index=0)]

From-scratch baseline

Same architecture, no pretraining. Much worse on this small dataset — illustrates why transfer learning is the default:

scratch_net = torchvision.models.resnet18()
scratch_net.fc = nn.Linear(scratch_net.fc.in_features, 2)
train_fine_tuning(scratch_net, 5e-4, param_group=False)

loss 0.379, train acc 0.837, test acc 0.836
1521.4 examples/sec on [device(type='cuda', index=0)]

What to vary

The natural ablations are: freeze more or fewer layers, change the backbone/head learning-rate ratio, and compare against the source ImageNet “hotdog” class weights.

for param in finetune_net.parameters():
    param.requires_grad = False

weight = pretrained_net.fc.weight
hotdog_w = torch.split(weight.data, 1, dim=0)[934]
hotdog_w.shape

torch.Size([1, 512])

Recap

Transfer learning: pretrained backbone + new head; almost always beats from-scratch on small / medium datasets.
Use small LR on the backbone (10×–100× smaller than the head LR) — pretrained features need only nudges.
Match input preprocessing (mean/std normalization, input size, or model-specific preprocess_input) to what the pretrained model expects.
Modern variants: feature-extractor mode (freeze everything but head), full fine-tune (everything trains), parameter-efficient methods (LoRA, adapters).