Objective: train + evaluate

What Is Hyperparameter Optimization?

What Is HPO?

Hyperparameters are the knobs you tune outside gradient descent: learning rate, batch size, depth, dropout rate. Usually 5–20 of them; the validation loss is non-convex, noisy, and expensive — one full training run per setting.

Hyperparameter optimization (HPO) automates the tuning. Simplest variant: random search — sample configurations from a prior, evaluate, keep the best.

The HPO workflow

Train multiple models with different hyperparameters; pick the best.

Random search beats grid search and most hand-tuning. Smarter algorithms (Bayesian opt, Hyperband) come next.

Formalizing HPO

Find \mathbf{x}^* = \arg\min_{\mathbf{x} \in \mathcal{X}} f(\mathbf{x}) where f is the validation error after training with hyperparameters \mathbf{x}, and \mathcal{X} is the configuration space — a structured product of discrete and continuous ranges.

import tensorflow as tf
tf.config.set_visible_devices([], 'GPU')
from d2l import tensorflow as d2l
import keras
import numpy as np
from scipy import stats

The “function” we’re optimizing is “train a model with this config, return validation error”. Wrap that into a clean callable:

class HPOTrainer(d2l.Trainer):
    def validation_error(self):
        accuracy = 0
        val_batch_idx = 0
        for batch in self.val_dataloader:
            x, y = self.prepare_batch(batch)
            y_hat = self.model(x, training=False)
            accuracy += self.model.accuracy(y_hat, y)
            val_batch_idx += 1
        return 1 - accuracy / val_batch_idx
def hpo_objective_softmax_classification(config, max_epochs=8):
    learning_rate = config["learning_rate"]
    import keras
    model = keras.Sequential([
        keras.layers.Flatten(),
        keras.layers.Dense(10),
    ])
    model.compile(
        optimizer=keras.optimizers.SGD(learning_rate=learning_rate),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy'],
    )
    data = d2l.FashionMNIST(batch_size=16)
    train_ds = data.get_dataloader(True)
    val_ds = data.get_dataloader(False)
    history = model.fit(train_ds, epochs=max_epochs, validation_data=val_ds,
                        verbose=0)
    val_acc = history.history['val_accuracy'][-1]
    return 1 - val_acc

Configuration space

A structured space — log-uniform for learning rate (spans orders of magnitude), uniform integer for layer counts, categorical for activations:

config_space = {"learning_rate": stats.loguniform(1e-4, 1)}

Recap

  • HPO = optimize a noisy, expensive black-box function over a structured config space.
  • Random search ≫ grid search at modest budget — Bergstra & Bengio 2012 settled this empirically.
  • Random search is also the natural baseline every fancy HPO algorithm has to beat.
  • Coming up: API abstraction, asynchronous parallel search, multi-fidelity (Hyperband, ASHA).