Schedules

An optimizer is three decisions: a direction, a step size over time, and a way of living with noise. This section is the second decision — the schedule t \mapsto \eta_t.

Two facts force it to exist:

Constant \eta: SGD parks on a noise floor \propto \eta. The rate must come down.
The target rate is often lethal at initialization. The rate must first come up.

Theory ranks few shapes (the proofs that exist live in the math appendix), so we proceed empirically: one CNN, one dataset, every schedule.

A testbed

LeNet-style CNN on Fashion-MNIST, modernized: ReLU, max-pooling, BatchNorm after every hidden layer, Xavier init pinned in both frameworks (without the norm layers, survivable \eta is a seed-dependent coin flip).

A scheduler is any callable epoch -> learning rate.
The training loop consults it at the start of every epoch and writes the rate into the optimizer; nothing else changes.
Stateless by design: a pure function of t can be plotted, resumed, and branched — that pays off at the end.

Baseline: constant \eta = 0.3

train loss 0.060, train acc 0.978, test acc 0.901

Both failure modes on one plot: loss noisy to the end (riding the noise floor), and test accuracy stalls while train accuracy climbs — overfitting.

Square-root decay

\eta_t = \eta_0 (t+1)^{-1/2} — the convex-optimal rate from 9.3. A scheduler is just a callable:

class SquareRootScheduler:
    def __init__(self, lr=0.1):
        self.lr = lr

    def __call__(self, epoch):
        return self.lr * pow(epoch + 1.0, -0.5)

scheduler = SquareRootScheduler(lr=0.3)
d2l.plot(d2l.arange(num_epochs), [scheduler(t) for t in range(num_epochs)],
         xlabel='epoch', ylabel='learning rate')

Square-root decay: training

train loss 0.090, train acc 0.970, test acc 0.895

Smoother and quieter — but test accuracy lands below the constant baseline. Timid at both ends: gives up the high early rate within ~3 epochs, yet ends with the largest tail \eta of any decay here. Shape matters.

Multiplicative and piecewise constant

\eta_t = \max(\eta_{\min}, \eta_0\, \alpha^t) — aggressive, floor as safety net:

class FactorScheduler:
    def __init__(self, factor=1, stop_factor_lr=1e-7, base_lr=0.1):
        self.factor = factor
        self.stop_factor_lr = stop_factor_lr
        self.base_lr = base_lr

    def __call__(self, epoch):
        return max(self.stop_factor_lr, self.base_lr * self.factor ** epoch)

scheduler = FactorScheduler(factor=0.9, stop_factor_lr=0.01, base_lr=0.3)
d2l.plot(d2l.arange(50), [scheduler(t) for t in range(50)],
         xlabel='epoch', ylabel='learning rate')

Piecewise constant: ride each noise floor until progress stalls, then cut the rate — the ImageNet-era staircase:

class MultiFactorScheduler:
    def __init__(self, milestones, factor, base_lr):
        self.milestones = milestones
        self.factor = factor
        self.base_lr = base_lr

    def __call__(self, epoch):
        lr = self.base_lr
        for milestone in self.milestones:
            if epoch >= milestone:
                lr *= self.factor
        return lr

scheduler = MultiFactorScheduler(milestones=[15, 25], factor=0.5, base_lr=0.3)
d2l.plot(d2l.arange(num_epochs), [scheduler(t) for t in range(num_epochs)],
         xlabel='epoch', ylabel='learning rate')

Cosine decay

\eta_t = \eta_T + \frac{\eta_0 - \eta_T}{2}(1 + \cos(\pi t / T))

One parameter, no milestones, no kinks (Loshchilov & Hutter, 2016).

train loss 0.080, train acc 0.975, test acc 0.910

The stay-high, decay-hard shapes (multiplicative, piecewise, cosine) beat the baseline and are too close to call from single runs. Cosine won on convenience, not measured superiority.

Warmup: the other end of the schedule

At initialization the loss surface is sharp; the target rate can kill the run in step one.

Standard fix since Goyal et al. (2017): linear ramp from \approx 0 over the first epochs.
Mechanism (Kalra & Barkeshli, 2024): early training at a growing rate reduces sharpness, raising the stability ceiling before the full rate arrives.
Adam has a second reason: an estimated preconditioner should not be trusted cold.

Warmup: dead vs. alive

Even a BatchNorm net has a ceiling. Cold start at \eta = 7.5 (25× the baseline) — collapsed to chance within one epoch:

train loss 2.011, train acc 0.196, test acc 0.198

Same rate through a 5-epoch ramp — trains to 80–90%:

train loss 0.204, train acc 0.923, test acc 0.879

Warmup + cosine

The default recipe of the late 2010s, and still strong:

train loss 0.046, train acc 0.988, test acc 0.910

Cosine’s hidden defect

The horizon T is baked in from step one.

Mid-run checkpoints: rate never came down → not finished models.
Want 2× the budget after the fact? Retrain.
Scaling-law study at 5 budgets? 5 full runs.

Warmup–stable–decay (MiniCPM; Hu et al., 2024): warm up, hold the peak constant for most of the run, decay in the last 10–20%. Every plateau checkpoint is horizon-free; the decay is a harvest step.

WSD: the loss cliff

train loss 0.027, train acc 0.994, test acc 0.909

Plateau loss sits above cosine’s, then drops in a cliff when the decay begins. Same final range as cosine at the same budget.

The river valley

Why the cliff (Wen et al., 2024): the loss surface is a winding valley — steep walls, gently sloping floor.

High constant rate: the iterate bounces between walls while drifting fast along the floor. Measured loss is inflated; progress is real.
Decay quenches the bouncing → the iterate settles to the floor it already reached → cliff.

The plateau travels, the decay lands. The noise-ball story of 9.3, upgraded from a bowl to a valley.

Branching off the plateau

Train warmup + stable only (no horizon committed), keep going as long as you like, then clone and decay whenever you want a finished model:

model_branch = nnx.clone(model_plateau)
decay = WSDScheduler(max_update=6, decay_steps=6, base_lr=0.3, final_lr=0.01)
train(model_branch, train_iter, test_iter, 6, lr, decay, animator=board,
      epoch_offset=30)
board.fig

train loss 0.015, train acc 0.998, test acc 0.911

The branch matches the full cosine run’s range — with the horizon chosen after the fact. One long run + cheap branched decays = models at many budgets (Hägele et al., 2024).

Plain SGD carries no state, so cloning parameters sufficed. With momentum or Adam, branch the optimizer state too.

State of play

Linear decay to zero matched or beat cosine and WSD in careful LLM sweeps (Bergsma et al., 2025) — the final rate matters most.
Schedule-free (Defazio et al., 2024): constant rate; gradients at an interpolation of iterate and average, evaluate the average — decay by averaging, horizon always “now”.
Not settled: GLM-4.5 ablated WSD vs. cosine and shipped cosine (Zeng et al., 2025). Differences at matched tuning are small.

What is settled: schedules are cheap, they matter, and plateau checkpoints you can decay on demand beat a horizon baked in at step one.