%matplotlib inline
from d2l import mxnet as d2l
import math
from mxnet import np, npx
npx.set_np()Adagrad’s accumulator \mathbf{s}_t = \sum_{\tau \le t} \mathbf{g}_\tau^2 grows without bound. The effective learning rate \eta / \sqrt{\mathbf{s}_t} collapses to zero — fine for convex / sparse problems, disastrous for deep non-convex training where the model never stops needing updates.
RMSProp (Hinton, 2012) replaces the running sum with an exponentially weighted average:
\mathbf{s}_t = \gamma \mathbf{s}_{t-1} + (1-\gamma) \mathbf{g}_t^2,\quad \mathbf{x}_t = \mathbf{x}_{t-1} - \frac{\eta}{\sqrt{\mathbf{s}_t + \epsilon}} \odot \mathbf{g}_t.
Finite memory: \sim 1/(1-\gamma) steps, typically \gamma = 0.9 → ~10 steps. Effective LR stops decaying; old gradient magnitudes are forgotten.
Visualize \gamma^t for several \gamma — choosing \gamma is choosing an effective time horizon:
Same skeleton as Adagrad, but the accumulator update is now an EMA. One extra hyperparameter (\gamma):
def rmsprop_2d(x1, x2, s1, s2):
g1, g2, eps = 0.2 * x1, 4 * x2, 1e-6
s1 = gamma * s1 + (1 - gamma) * g1 ** 2
s2 = gamma * s2 + (1 - gamma) * g2 ** 2
x1 -= eta / math.sqrt(s1 + eps) * g1
x2 -= eta / math.sqrt(s2 + eps) * g2
return x1, x2, s1, s2
def f_2d(x1, x2):
return 0.1 * x1 ** 2 + 2 * x2 ** 2
eta, gamma = 0.4, 0.9
d2l.show_trace_2d(f_2d, d2l.train_2d(rmsprop_2d))