The optimizer matters; the learning rate schedule often matters more. With a constant \eta you trade off fast-but-unstable vs. slow-but-converged. A good schedule gets both: aggressive early, careful late.
LeNet on Fashion-MNIST as the experimental harness:
The implementation is deliberately ordinary so that the schedule effect is the only teaching variable.
Constant \eta=0.3 is the baseline. Watch for the usual pattern: fast early movement, then noisy late progress as the step size stays too large for fine tuning.
loss 0.209, train acc 0.922, test acc 0.896
21751.0 examples/sec on /GPU:0
<Sequential name=sequential, built=True>
Fixed learning rates expose the central tradeoff:
learning rate is now , 0.1
The first custom scheduler uses \eta_t = \eta_0(t+1)^{-1/2}.
It is simple and monotone, but modern practice usually prefers multi-step or cosine schedules.
Apply the same \eta_t = \eta_0(t+1)^{-1/2} policy during training. The curve should smooth out because late updates are smaller:
loss 0.386, train acc 0.858, test acc 0.848
32394.2 examples/sec on /GPU:0
<Sequential name=sequential_2, built=True>
\eta_t = \eta_0 \cdot (1 + \beta t)^{-\alpha} — gradual decay. The ML classic before step decay took over:
Drop \eta by a fixed factor at preset epochs (e.g. 30, 60, 90). Standard for ImageNet ResNet training:
loss 0.238, train acc 0.912, test acc 0.884
38403.3 examples/sec on /GPU:0
<Sequential name=sequential_3, built=True>
Notice the loss curve changes slope after each scheduled drop: high \eta explores quickly, then lower \eta settles into a narrower basin.
\eta_t = \eta_{\min} + \tfrac{1}{2}(\eta_{\max} - \eta_{\min})(1 + \cos(\pi t / T)). Smooth decay over T steps; small \eta at the end yields clean fine-tuning. Often paired with warmup and warm restarts:
loss 0.257, train acc 0.906, test acc 0.881
38242.6 examples/sec on /GPU:0
<Sequential name=sequential_4, built=True>
Cosine avoids abrupt jumps. The tail becomes increasingly conservative, which often improves final accuracy without manual milestone tuning.
Adam-trained Transformers diverge if \eta starts at the target value — preconditioner \hat{\mathbf{s}}_t hasn’t stabilized yet. Linear warmup from 0 to \eta_0 over the first ~1k steps fixes it:
loss 0.270, train acc 0.902, test acc 0.879
32601.9 examples/sec on /GPU:0
<Sequential name=sequential_5, built=True>
Warmup spends early epochs ramping up instead of taking a full-size first step. This protects unstable initial statistics, then hands off to the cosine decay.