Learning Rate Scheduling

Learning Rate Schedules

The optimizer matters; the learning rate schedule often matters more. With a constant \eta you trade off fast-but-unstable vs. slow-but-converged. A good schedule gets both: aggressive early, careful late.

What a schedule manages

Initial \eta — too large diverges, too small wastes time.
Decay over training — final fine-tuning needs small \eta for noise to settle.
Early instability — Transformers and adaptive optimizers can blow up in the first few hundred steps without warmup.

Common schedules

Step / multi-step decay — drop \eta by 10× at preset epochs.
Cosine annealing — smooth decay following a half cosine; popular in vision and Transformers.
Warmup — linearly grow \eta for the first ~T_w steps, then decay. Standard for Transformers.

Toy training loop

LeNet on Fashion-MNIST as the experimental harness:

same model and data for every schedule;
only the learning-rate policy changes;
compare training curves, not isolated final numbers.

The implementation is deliberately ordinary so that the schedule effect is the only teaching variable.

Toy baseline

Constant \eta=0.3 is the baseline. Watch for the usual pattern: fast early movement, then noisy late progress as the step size stays too large for fine tuning.

loss 0.209, train acc 0.922, test acc 0.896
21751.0 examples/sec on /GPU:0
<Sequential name=sequential, built=True>

Constant-LR baselines

Fixed learning rates expose the central tradeoff:

large \eta: quick early progress, noisy late convergence;
small \eta: stable late convergence, slow early progress.

learning rate is now , 0.1

Constant-LR baselines (cont.)

The first custom scheduler uses \eta_t = \eta_0(t+1)^{-1/2}.

It is simple and monotone, but modern practice usually prefers multi-step or cosine schedules.

Square-root schedule training

Apply the same \eta_t = \eta_0(t+1)^{-1/2} policy during training. The curve should smooth out because late updates are smaller:

loss 0.386, train acc 0.858, test acc 0.848
32394.2 examples/sec on /GPU:0
<Sequential name=sequential_2, built=True>

Polynomial / factor decay

\eta_t = \eta_0 \cdot (1 + \beta t)^{-\alpha} — gradual decay. The ML classic before step decay took over:

Multi-step decay

Drop \eta by a fixed factor at preset epochs (e.g. 30, 60, 90). Standard for ImageNet ResNet training:

Multi-step training

loss 0.238, train acc 0.912, test acc 0.884
38403.3 examples/sec on /GPU:0
<Sequential name=sequential_3, built=True>

Notice the loss curve changes slope after each scheduled drop: high \eta explores quickly, then lower \eta settles into a narrower basin.

Cosine annealing

\eta_t = \eta_{\min} + \tfrac{1}{2}(\eta_{\max} - \eta_{\min})(1 + \cos(\pi t / T)). Smooth decay over T steps; small \eta at the end yields clean fine-tuning. Often paired with warmup and warm restarts:

Cosine training

loss 0.257, train acc 0.906, test acc 0.881
38242.6 examples/sec on /GPU:0
<Sequential name=sequential_4, built=True>

Cosine avoids abrupt jumps. The tail becomes increasingly conservative, which often improves final accuracy without manual milestone tuning.

Warmup

Adam-trained Transformers diverge if \eta starts at the target value — preconditioner \hat{\mathbf{s}}_t hasn’t stabilized yet. Linear warmup from 0 to \eta_0 over the first ~1k steps fixes it:

Warmup training

loss 0.270, train acc 0.902, test acc 0.879
32601.9 examples/sec on /GPU:0
<Sequential name=sequential_5, built=True>

Warmup spends early epochs ramping up instead of taking a full-size first step. This protects unstable initial statistics, then hands off to the cosine decay.

Recap

Schedule beats fixed \eta — aggressive early, gentle late.
Multi-step is the vision standard; cosine is smoother and often slightly better with the same budget.
Warmup is mandatory for Transformer / large-Adam training: prevents early divergence as the second-moment EMA stabilizes.
“Cosine + warmup” is the modern default. Most LLM training does exactly this.