What Is Hyperparameter Optimization?

What Is HPO?

Hyperparameters are the knobs you tune outside gradient descent: learning rate, batch size, depth, dropout rate. Usually 5–20 of them; the validation loss is non-convex, noisy, and expensive — one full training run per setting.

Hyperparameter optimization (HPO) automates the tuning. Simplest variant: random search — sample configurations from a prior, evaluate, keep the best.

The HPO workflow

Train multiple models with different hyperparameters; pick the best.

Random search beats grid search and most hand-tuning. Smarter algorithms (Bayesian opt, Hyperband) come next.

Formalizing HPO

Find \mathbf{x}^* = \arg\min_{\mathbf{x} \in \mathcal{X}} f(\mathbf{x}) where f is the validation error after training with hyperparameters \mathbf{x}, and \mathcal{X} is the configuration space — a structured product of discrete and continuous ranges.

from d2l import jax as d2l
from flax import nnx
from jax import numpy as jnp
import numpy as np
from scipy import stats

Objective: train + evaluate

The “function” we’re optimizing is “train a model with this config, return validation error”. Wrap that into a clean callable:

@nnx.jit
def hpo_validation_batch(model, batch):
    _, batch_accuracy = model.validation_step(batch)
    num_examples = batch[-1].size
    return jnp.array([batch_accuracy * num_examples, num_examples])

class HPOTrainer(d2l.Trainer):
    def validation_error(self):
        metric = jnp.zeros(2)  # num_correct, num_examples
        for batch in self.val_dataloader:
            batch = self.prepare_batch(batch)
            metric += hpo_validation_batch(self.val_model, batch)
        return 1 - metric[0] / metric[1]

def hpo_objective_softmax_classification(config, max_epochs=8):
    learning_rate = config["learning_rate"]
    trainer = d2l.HPOTrainer(max_epochs=max_epochs)
    data = d2l.FashionMNIST(batch_size=16)
    model = d2l.SoftmaxRegression(num_outputs=10, lr=learning_rate)
    trainer.fit(model=model, data=data)
    return float(trainer.validation_error())

Configuration space

A structured space — log-uniform for learning rate (spans orders of magnitude), uniform integer for layer counts, categorical for activations:

config_space = {"learning_rate": stats.loguniform(1e-4, 1)}

Random search

Iterate: draw random config, evaluate, log. Keep the best seen so far. Brutally simple, surprisingly effective:

errors, values = [], []
num_iterations = 5

for i in range(num_iterations):
    learning_rate = config_space["learning_rate"].rvs()
    print(f"Trial {i}: learning_rate = {learning_rate}")
    y = hpo_objective_softmax_classification({"learning_rate": learning_rate})
    print(f"    validation_error = {y}")
    values.append(learning_rate)
    errors.append(y)

    validation_error = 0.1923999786376953

best_idx = np.argmin(errors)
print(f"optimal learning rate = {values[best_idx]}")

optimal learning rate = 0.003136396628555046

Recap

HPO = optimize a noisy, expensive black-box function over a structured config space.
Random search ≫ grid search at modest budget — Bergstra & Bengio 2012 settled this empirically.
Random search is also the natural baseline every fancy HPO algorithm has to beat.
Coming up: API abstraction, asynchronous parallel search, multi-fidelity (Hyperband, ASHA).