%matplotlib inline
from d2l import torch as d2l
from torch import nn
import torch
import torchvision
import osYou’ll rarely train a vision model from scratch. Transfer learning — start from weights pretrained on a big dataset (ImageNet) and adapt to your small one — is the default recipe.
Fine-tuning: pretrained backbone + new task-specific head.
A tiny binary classification dataset (hot dog / not hot dog) — too small to train a CNN from scratch, perfect for transfer learning:
Standard ImageNet recipe — random resized crop + flip for training, center crop for eval. Match the preprocessing convention that the pretrained model expects:
# Specify the means and standard deviations of the three RGB channels to
# standardize each channel
normalize = torchvision.transforms.Normalize(
[0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
train_augs = torchvision.transforms.Compose([
torchvision.transforms.RandomResizedCrop(224),
torchvision.transforms.RandomHorizontalFlip(),
torchvision.transforms.ToTensor(),
normalize])
test_augs = torchvision.transforms.Compose([
torchvision.transforms.Resize([256, 256]),
torchvision.transforms.CenterCrop(224),
torchvision.transforms.ToTensor(),
normalize])The source model was trained for 1000 ImageNet classes. Its convolutional body is reusable; the final classifier is task-specific and will be replaced:
Create a target model with the same pretrained backbone and a randomly initialized 2-way classifier for hot dog vs. not hot dog:
Linear(in_features=512, out_features=1000, bias=True)
Let \theta_b be pretrained backbone parameters and \theta_h the new head. Use a small step on \theta_b and a larger one on \theta_h:
\eta_b = \eta,\qquad \eta_h = 10\eta.
The helper hides framework details: parameter groups, optimizer construction, metric logging, and the scratch/fine-tune switch. The four-step pattern is:
With matched ImageNet preprocessing and a small base LR, the pretrained model should reach useful accuracy quickly. The point is not just a better final score; it is much less data and compute than training the same network cold.
loss 0.238, train acc 0.912, test acc 0.880
1876.0 examples/sec on [device(type='cuda', index=0)]
Same architecture, no pretraining. Much worse on this small dataset — illustrates why transfer learning is the default:
loss 0.400, train acc 0.820, test acc 0.849
2093.3 examples/sec on [device(type='cuda', index=0)]
The natural ablations are: freeze more or fewer layers, change the backbone/head learning-rate ratio, and compare against the source ImageNet “hotdog” class weights.
preprocess_input) to what the pretrained model expects.