The optimizer matters; the learning rate schedule often matters more. With a constant \eta you trade off fast-but-unstable vs. slow-but-converged. A good schedule gets both: aggressive early, careful late.
LeNet on Fashion-MNIST as the experimental harness:
The implementation is deliberately ordinary so that the schedule effect is the only teaching variable.
Constant \eta=0.3 is the baseline. Watch for the usual pattern: fast early movement, then noisy late progress as the step size stays too large for fine tuning.
Fixed learning rates expose the central tradeoff:
The first custom scheduler uses \eta_t = \eta_0(t+1)^{-1/2}.
It is simple and monotone, but modern practice usually prefers multi-step or cosine schedules.
Apply the same \eta_t = \eta_0(t+1)^{-1/2} policy during training. The curve should smooth out because late updates are smaller:
\eta_t = \eta_0 \cdot (1 + \beta t)^{-\alpha} — gradual decay. The ML classic before step decay took over:
Drop \eta by a fixed factor at preset epochs (e.g. 30, 60, 90). Standard for ImageNet ResNet training:
Notice the loss curve changes slope after each scheduled drop: high \eta explores quickly, then lower \eta settles into a narrower basin.
\eta_t = \eta_{\min} + \tfrac{1}{2}(\eta_{\max} - \eta_{\min})(1 + \cos(\pi t / T)). Smooth decay over T steps; small \eta at the end yields clean fine-tuning. Often paired with warmup and warm restarts:
Cosine avoids abrupt jumps. The tail becomes increasingly conservative, which often improves final accuracy without manual milestone tuning.
Adam-trained Transformers diverge if \eta starts at the target value — preconditioner \hat{\mathbf{s}}_t hasn’t stabilized yet. Linear warmup from 0 to \eta_0 over the first ~1k steps fixes it:
Warmup spends early epochs ramping up instead of taking a full-size first step. This protects unstable initial statistics, then hands off to the cosine decay.