Asynchronous Random Search

Random search is embarrassingly parallel — each trial is independent. With K machines, you’d hope for K \times speedup. But synchronous parallelism wastes time on stragglers: every batch of K trials waits for the slowest one.

Asynchronous random search keeps every worker busy: when one finishes, immediately give it a new config. Total wall-clock time scales much better.

Sync vs async parallel HPO: async avoids idle workers when trials finish at different times.

This deck wires up an async scheduler around the API abstraction from the previous deck.

Setup

from d2l import torch as d2l
import logging
# Use INFO level so the periodic Syne Tune tuning-status table appears,
# but use a clean format that drops the "INFO:syne_tune.tuner:" prefix.
logging.basicConfig(level=logging.INFO, format="%(message)s", force=True)
# Silence Syne Tune's import-time chatter about optional AWS dependencies
# (sagemaker, s3fs) and Ray Tune. We use the local PythonBackend, so those
# are not needed. Suppress both print() and logging.info() during imports.
import contextlib, io
_root = logging.getLogger()
_prev_level = _root.level
_root.setLevel(logging.WARNING)
try:
    with contextlib.redirect_stdout(io.StringIO()):
        from syne_tune.config_space import loguniform, randint
        from syne_tune.backend.python_backend.python_backend import PythonBackend
        from syne_tune.optimizer.baselines import RandomSearch
        from syne_tune import Tuner, StoppingCriterion
        from syne_tune.experiments import load_experiment
finally:
    _root.setLevel(_prev_level)

# Silence the per-trial subprocess-command spam from local_backend and
# drop the per-trial scheduling / completion lines from the tuner logger.
# Keep the periodic "tuning status (last metric is reported)" updates so
# the reader can still see progress over time.
class _DropPerTrialNoise(logging.Filter):
    _DROP = (
        "results of trials will be saved",
        "scheduled ",
        "Trial trial_id ",
    )
    def filter(self, record):
        msg = record.getMessage()
        return not any(s in msg for s in self._DROP)

logging.getLogger("syne_tune.backend.local_backend").setLevel(logging.WARNING)
logging.getLogger("syne_tune.tuner").addFilter(_DropPerTrialNoise())

Objective with simulated wall time

A toy objective whose runtime depends on the hyperparameters — exposes the straggler problem clearly:

def hpo_objective_lenet_synetune(learning_rate, batch_size, max_epochs):
    from d2l import torch as d2l    
    from syne_tune import Reporter

    model = d2l.LeNet(lr=learning_rate, num_classes=10)
    trainer = d2l.HPOTrainer(max_epochs=1, num_gpus=1)
    data = d2l.FashionMNIST(batch_size=batch_size)
    model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
    report = Reporter() 
    for epoch in range(1, max_epochs + 1):
        if epoch == 1:
            # Initialize the state of Trainer
            trainer.fit(model=model, data=data) 
        else:
            trainer.fit_epoch()
        validation_error = d2l.numpy(trainer.validation_error().cpu())
        report(epoch=epoch, validation_error=float(validation_error))

Async scheduler

Maintain a worker pool; on each tick, dispatch new trials to free workers; collect completed trial results asynchronously:

# Each LeNet trial fits in well under 7 GB of GPU memory, so we can pack
# multiple trials per device. `PythonBackend(rotate_gpus=True)` (the
# default) round-robins trials across detected GPUs and falls back to
# sharing when `n_workers > num_gpus`. Allocate 7 GB per slot — this
# yields 3 slots on a 24 GB card and 4 slots on a 32 GB card after
# driver overhead, e.g. 4×24 GB → 12 slots; 2×32 GB → 8.
import torch
_GB = 1024 ** 3
n_workers = sum(
    torch.cuda.get_device_properties(i).total_memory // (7 * _GB)
    for i in range(torch.cuda.device_count())
) or 1

max_wallclock_time = 15 * 60  # 15 minutes

mode = "min"
metric = "validation_error"

Scheduler (cont.)

config_space = {
    "learning_rate": loguniform(1e-2, 1),
    "batch_size": randint(32, 256),
    "max_epochs": 10,
}
initial_config = {
    "learning_rate": 0.1,
    "batch_size": 128,
}

trial_backend = PythonBackend(
    tune_function=hpo_objective_lenet_synetune,
    config_space=config_space,
)

Wiring up the loop

scheduler = RandomSearch(
    config_space,
    metric=metric,
    mode=mode,
    points_to_evaluate=[initial_config],
)

stop_criterion = StoppingCriterion(max_wallclock_time=max_wallclock_time)

tuner = Tuner(
    trial_backend=trial_backend,
    scheduler=scheduler, 
    stop_criterion=stop_criterion,
    n_workers=n_workers,
    print_update_interval=int(max_wallclock_time * 0.6),
)

Loop (cont.)

Run the tuner, then load the experiment results. The raw Syne Tune logs contain local paths and backend commands, so the slide keeps the plot-producing analysis cell instead of the console transcript.

d2l.set_figsize()
tuning_experiment = load_experiment(tuner.name)
tuning_experiment.plot()

Wall-clock advantage

Plot best-seen-vs-time for sync vs async. Async always makes progress; sync sits idle while waiting for stragglers:

d2l.set_figsize([6, 2.5])
results = tuning_experiment.results

for trial_id in results.trial_id.unique():
    df = results[results["trial_id"] == trial_id]
    d2l.plt.plot(
        df["st_tuner_time"],
        df["validation_error"],
        marker="o"
    )
    
d2l.plt.xlabel("wall-clock time")
d2l.plt.ylabel("objective function")

Recap

Async random search ≈ sync random search statistically, but much better wall-clock-wise.
The skeleton (worker pool, dispatch on availability) generalizes to any HPO algorithm — not just random search.
Production HPO libraries (SyneTune, Ray Tune) make async the default.