The optimizer matters; the learning rate schedule often matters more. With a constant \eta you trade off fast-but-unstable vs. slow-but-converged. A good schedule gets both: aggressive early, careful late.
LeNet on Fashion-MNIST as the experimental harness:
The implementation is deliberately ordinary so that the schedule effect is the only teaching variable.
Constant \eta=0.3 is the baseline. Watch for the usual pattern: fast early movement, then noisy late progress as the step size stays too large for fine tuning.
train loss 0.147, train acc 0.944, test acc 0.888
Fixed learning rates expose the central tradeoff:
learning rate is now 0.10
The first custom scheduler uses \eta_t = \eta_0(t+1)^{-1/2}.
It is simple and monotone, but modern practice usually prefers multi-step or cosine schedules.
Apply the same \eta_t = \eta_0(t+1)^{-1/2} policy during training. The curve should smooth out because late updates are smaller:
train loss 0.271, train acc 0.902, test acc 0.877
\eta_t = \eta_0 \cdot (1 + \beta t)^{-\alpha} — gradual decay. The ML classic before step decay took over:
Drop \eta by a fixed factor at preset epochs (e.g. 30, 60, 90). Standard for ImageNet ResNet training:
train loss 0.195, train acc 0.926, test acc 0.884
Notice the loss curve changes slope after each scheduled drop: high \eta explores quickly, then lower \eta settles into a narrower basin.
\eta_t = \eta_{\min} + \tfrac{1}{2}(\eta_{\max} - \eta_{\min})(1 + \cos(\pi t / T)). Smooth decay over T steps; small \eta at the end yields clean fine-tuning. Often paired with warmup and warm restarts:
train loss 0.192, train acc 0.930, test acc 0.901
Cosine avoids abrupt jumps. The tail becomes increasingly conservative, which often improves final accuracy without manual milestone tuning.
Adam-trained Transformers diverge if \eta starts at the target value — preconditioner \hat{\mathbf{s}}_t hasn’t stabilized yet. Linear warmup from 0 to \eta_0 over the first ~1k steps fixes it:
train loss 0.202, train acc 0.926, test acc 0.902
Warmup spends early epochs ramping up instead of taking a full-size first step. This protects unstable initial statistics, then hands off to the cosine decay.