%matplotlib inline
from d2l import mxnet as d2l
import math
from mxnet import np, npx
npx.set_np()What if different parameters need different learning rates? A rare feature gets updated once per million steps; a common one every step. Sharing \eta forces a compromise — too small for the rare, too large for the common.
Adagrad (Duchi, Hazan, Singer 2011) gives each parameter its own learning rate, scaled by the square root of all past squared gradients:
\mathbf{s}_t = \mathbf{s}_{t-1} + \mathbf{g}_t^2,\quad \mathbf{x}_t = \mathbf{x}_{t-1} - \frac{\eta}{\sqrt{\mathbf{s}_t + \epsilon}} \odot \mathbf{g}_t.
Coordinates with large gradients shrink their effective step; rarely-updated coordinates keep theirs. The seed of every modern adaptive optimizer.
Same anisotropic quadratic. Adagrad self-adapts the step sizes per coordinate:
Bigger learning rate is now safe — the \sqrt{\mathbf{s}_t} divisor handles the dynamic range:
def adagrad_2d(x1, x2, s1, s2):
eps = 1e-6
g1, g2 = 0.2 * x1, 4 * x2
s1 += g1 ** 2
s2 += g2 ** 2
x1 -= eta / math.sqrt(s1 + eps) * g1
x2 -= eta / math.sqrt(s2 + eps) * g2
return x1, x2, s1, s2
def f_2d(x1, x2):
return 0.1 * x1 ** 2 + 2 * x2 ** 2
eta = 0.4
d2l.show_trace_2d(f_2d, d2l.train_2d(adagrad_2d))Carry one accumulator \mathbf{s} per parameter. Add \epsilon to avoid division by zero on the first step: