Learning Rate Scheduling

Learning Rate Schedules

The optimizer matters; the learning rate schedule often matters more. With a constant \eta you trade off fast-but-unstable vs. slow-but-converged. A good schedule gets both: aggressive early, careful late.

What a schedule manages

  • Initial \eta — too large diverges, too small wastes time.
  • Decay over training — final fine-tuning needs small \eta for noise to settle.
  • Early instability — Transformers and adaptive optimizers can blow up in the first few hundred steps without warmup.

Common schedules

  • Step / multi-step decay — drop \eta by 10× at preset epochs.
  • Cosine annealing — smooth decay following a half cosine; popular in vision and Transformers.
  • Warmup — linearly grow \eta for the first ~T_w steps, then decay. Standard for Transformers.

Toy training loop

LeNet on Fashion-MNIST as the experimental harness:

  • same model and data for every schedule;
  • only the learning-rate policy changes;
  • compare training curves, not isolated final numbers.

The implementation is deliberately ordinary so that the schedule effect is the only teaching variable.

Toy baseline

Constant \eta=0.3 is the baseline. Watch for the usual pattern: fast early movement, then noisy late progress as the step size stays too large for fine tuning.

train loss 0.071, train acc 0.993, test acc 0.902

Constant-LR baselines

Fixed learning rates expose the central tradeoff:

  • large \eta: quick early progress, noisy late convergence;
  • small \eta: stable late convergence, slow early progress.
learning rate is now 0.10

Constant-LR baselines (cont.)

The first custom scheduler uses \eta_t = \eta_0(t+1)^{-1/2}.

It is simple and monotone, but modern practice usually prefers multi-step or cosine schedules.

Square-root schedule training

Apply the same \eta_t = \eta_0(t+1)^{-1/2} policy during training. The curve should smooth out because late updates are smaller:

train loss 0.222, train acc 0.932, test acc 0.893

Polynomial / factor decay

\eta_t = \eta_0 \cdot (1 + \beta t)^{-\alpha} — gradual decay. The ML classic before step decay took over:

Multi-step decay

Drop \eta by a fixed factor at preset epochs (e.g. 30, 60, 90). Standard for ImageNet ResNet training:

Multi-step training

train loss 2.303, train acc 0.106, test acc 0.100

Notice the loss curve changes slope after each scheduled drop: high \eta explores quickly, then lower \eta settles into a narrower basin.

Cosine annealing

\eta_t = \eta_{\min} + \tfrac{1}{2}(\eta_{\max} - \eta_{\min})(1 + \cos(\pi t / T)). Smooth decay over T steps; small \eta at the end yields clean fine-tuning. Often paired with warmup and warm restarts:

Cosine training

train loss 0.109, train acc 0.965, test acc 0.911

Cosine avoids abrupt jumps. The tail becomes increasingly conservative, which often improves final accuracy without manual milestone tuning.

Warmup

Adam-trained Transformers diverge if \eta starts at the target value — preconditioner \hat{\mathbf{s}}_t hasn’t stabilized yet. Linear warmup from 0 to \eta_0 over the first ~1k steps fixes it:

Warmup training

train loss 0.118, train acc 0.961, test acc 0.911

Warmup spends early epochs ramping up instead of taking a full-size first step. This protects unstable initial statistics, then hands off to the cosine decay.

Recap

  • Schedule beats fixed \eta — aggressive early, gentle late.
  • Multi-step is the vision standard; cosine is smoother and often slightly better with the same budget.
  • Warmup is mandatory for Transformer / large-Adam training: prevents early divergence as the second-moment EMA stabilizes.
  • “Cosine + warmup” is the modern default. Most LLM training does exactly this.