Fine-Tuning

You’ll rarely train a vision model from scratch. Transfer learning — start from weights pretrained on a big dataset (ImageNet) and adapt to your small one — is the default recipe.

Fine-tuning: pretrained backbone + new task-specific head.

The standard recipe

Take a pretrained network (ResNet, ViT, etc.).
Replace the output layer with a head for your task.
Optionally freeze early layers; train the rest.
Small LR on the pretrained part, larger LR on the new head.

Setup

%matplotlib inline
import os
from d2l import tensorflow as d2l
import tensorflow as tf
import keras

The hot-dog dataset

A tiny binary classification dataset (hot dog / not hot dog) — too small to train a CNN from scratch, perfect for transfer learning:

d2l.DATA_HUB['hotdog'] = (d2l.DATA_URL + 'hotdog.zip', 
                         'fba480ffa8aa7e0febbb511d181409f899b9baa5')

data_dir = d2l.download_extract('hotdog')

from PIL import Image as _PILImage
import pathlib

def _load_image_folder(path):
    """Load images from a directory with class subfolders, returning
    a list of (PIL.Image, class_index) tuples."""
    path = pathlib.Path(path)
    class_names = sorted([p.name for p in path.iterdir() if p.is_dir()])
    class_to_idx = {c: i for i, c in enumerate(class_names)}
    items = []
    for cls in class_names:
        for img_path in sorted((path / cls).iterdir()):
            try:
                img = _PILImage.open(str(img_path)).convert('RGB')
                items.append((img, class_to_idx[cls]))
            except Exception:
                continue
    return items

train_imgs = _load_image_folder(os.path.join(data_dir, 'train'))
test_imgs = _load_image_folder(os.path.join(data_dir, 'test'))

hotdogs = [train_imgs[i][0] for i in range(8)]
not_hotdogs = [train_imgs[-i - 1][0] for i in range(8)]
d2l.show_images(hotdogs + not_hotdogs, 2, 8, scale=1.4);

Augmentation pipelines

Standard ImageNet recipe — random resized crop + flip for training, center crop for eval. Match the preprocessing convention that the pretrained model expects:

# Plain tf.image / tf.data preprocessing for Keras ResNet50 (NHWC). Keras
# ResNet50 expects its own `preprocess_input` convention, not PyTorch-style
# RGB mean/std normalization.
IMG_SIZE = 224

def _normalize(x):
    return tf.keras.applications.resnet50.preprocess_input(
        tf.cast(x, tf.float32))

def train_augs(x, training=False):
    # Input is (256, 256, 3) — already resized by image_dataset_from_directory.
    x = tf.image.random_crop(x, (IMG_SIZE, IMG_SIZE, 3))
    x = tf.image.random_flip_left_right(x)
    return _normalize(x)

def test_augs(x, training=False):
    # Input is (256, 256, 3) — already resized by image_dataset_from_directory.
    # Center crop to IMG_SIZE x IMG_SIZE.
    off = (256 - IMG_SIZE) // 2
    x = x[off:off + IMG_SIZE, off:off + IMG_SIZE, :]
    return _normalize(x)

Inspect the pretrained head

The source model was trained for 1000 ImageNet classes. Its convolutional body is reusable; the final classifier is task-specific and will be replaced:

# Load pretrained ResNet50 (full model with top) to inspect the output layer
pretrained_net = keras.applications.ResNet50(weights='imagenet')

Replace the task head

Create a target model with the same pretrained backbone and a randomly initialized 2-way classifier for hot dog vs. not hot dog:

pretrained_net.layers[-1]

<Dense name=predictions, built=True>

Discriminative learning rates

Let \theta_b be pretrained backbone parameters and \theta_h the new head. Use a small step on \theta_b and a larger one on \theta_h:

\eta_b = \eta,\qquad \eta_h = 10\eta.

# Pretrained ResNet50 base (no top) + global average pool + fresh 2-class head
finetune_net = keras.Sequential([
    keras.applications.ResNet50(weights='imagenet', include_top=False,
                                pooling='avg',
                                input_shape=(IMG_SIZE, IMG_SIZE, 3)),
    keras.layers.Dense(2, kernel_initializer='glorot_uniform',
                       name='classifier'),
])

Training helper

The helper hides framework details: parameter groups, optimizer construction, metric logging, and the scratch/fine-tune switch. The four-step pattern is:

build the pretrained backbone and new head;
assign a small learning rate to backbone parameters;
assign a larger learning rate to the randomly initialized head;
train and compare against a scratch baseline.

Run fine-tuning

With matched ImageNet preprocessing and a small base LR, the pretrained model should reach useful accuracy quickly. The point is not just a better final score; it is much less data and compute than training the same network cold.

train_fine_tuning(finetune_net, 5e-5, momentum=0.9)

epoch 1, loss 0.820, train acc 0.517, test acc 0.751
epoch 2, loss 0.425, train acc 0.796, test acc 0.901
epoch 3, loss 0.279, train acc 0.895, test acc 0.915
epoch 4, loss 0.228, train acc 0.918, test acc 0.934
epoch 5, loss 0.212, train acc 0.923, test acc 0.941

From-scratch baseline

Same architecture, no pretraining. Much worse on this small dataset — illustrates why transfer learning is the default:

# Train from scratch: same architecture but with random (no-pretrain) weights.
scratch_base = keras.applications.ResNet50(
    weights=None, include_top=False, pooling='avg',
    input_shape=(IMG_SIZE, IMG_SIZE, 3))
# Keras' default BatchNormalization momentum (0.99) means the moving
# mean/variance never catch up to the actual activation statistics within
# five epochs of ~15 batches each, so the from-scratch model would look
# like random noise at evaluation time (train acc rises, test acc stays
# ~0.5). Lowering momentum to 0.5 lets the running stats track the small
# dataset; the pretrained fine-tuning path keeps the default because its
# moving stats are already calibrated on ImageNet.
for layer in scratch_base.layers:
    if isinstance(layer, keras.layers.BatchNormalization):
        layer.momentum = 0.5
scratch_net = keras.Sequential([
    scratch_base,
    keras.layers.Dense(2, kernel_initializer='glorot_uniform',
                       name='classifier'),
])
# Plain SGD (momentum=0) cannot train a randomly-initialised ResNet-50 on
# 2 000 images in five epochs, so we use SGD with momentum=0.9 and a single
# uniform learning rate (no head/backbone split) for the from-scratch run.
train_fine_tuning(scratch_net, 1e-3, param_group=False, momentum=0.9)

epoch 1, loss 0.748, train acc 0.531, test acc 0.505
epoch 2, loss 0.754, train acc 0.573, test acc 0.562
epoch 3, loss 0.555, train acc 0.705, test acc 0.822
epoch 4, loss 0.410, train acc 0.815, test acc 0.833
epoch 5, loss 0.428, train acc 0.822, test acc 0.733

What to vary

The natural ablations are: freeze more or fewer layers, change the backbone/head learning-rate ratio, and compare against the source ImageNet “hotdog” class weights.

# Freeze the ResNet50 backbone (layer 0 of the Sequential); only head trains.
finetune_net.layers[0].trainable = False

weight = pretrained_net.layers[-1].get_weights()[0]  # Shape: (2048, 1000)
hotdog_w = weight[:, 934]
hotdog_w.shape

(2048,)

Recap

Transfer learning: pretrained backbone + new head; almost always beats from-scratch on small / medium datasets.
Use small LR on the backbone (10×–100× smaller than the head LR) — pretrained features need only nudges.
Match input preprocessing (mean/std normalization, input size, or model-specific preprocess_input) to what the pretrained model expects.
Modern variants: feature-extractor mode (freeze everything but head), full fine-tune (everything trains), parameter-efficient methods (LoRA, adapters).