RMSProp

From Adagrad to RMSProp

Adagrad’s accumulator \mathbf{s}_t = \sum_{\tau \le t} \mathbf{g}_\tau^2 grows without bound. The effective learning rate \eta / \sqrt{\mathbf{s}_t} collapses to zero — fine for convex / sparse problems, disastrous for deep non-convex training where the model never stops needing updates.

RMSProp

RMSProp (Hinton, 2012) replaces the running sum with an exponentially weighted average:

\mathbf{s}_t = \gamma \mathbf{s}_{t-1} + (1-\gamma) \mathbf{g}_t^2,\quad \mathbf{x}_t = \mathbf{x}_{t-1} - \frac{\eta}{\sqrt{\mathbf{s}_t + \epsilon}} \odot \mathbf{g}_t.

Finite memory: \sim 1/(1-\gamma) steps, typically \gamma = 0.9 → ~10 steps. Effective LR stops decaying; old gradient magnitudes are forgotten.

Decay coefficients

Visualize \gamma^t for several \gamma — choosing \gamma is choosing an effective time horizon:

%matplotlib inline
from d2l import mxnet as d2l
import math
from mxnet import np, npx

npx.set_np()

Demo on the anisotropic quadratic:

d2l.set_figsize()
gammas = [0.95, 0.9, 0.8, 0.7]
for gamma in gammas:
    x = d2l.numpy(d2l.arange(40))
    d2l.plt.plot(x, (1-gamma) * gamma ** x, label=f'gamma = {gamma:.2f}')
d2l.plt.xlabel('time');

From-scratch RMSProp

Same skeleton as Adagrad, but the accumulator update is now an EMA. One extra hyperparameter (\gamma):

def rmsprop_2d(x1, x2, s1, s2):
    g1, g2, eps = 0.2 * x1, 4 * x2, 1e-6
    s1 = gamma * s1 + (1 - gamma) * g1 ** 2
    s2 = gamma * s2 + (1 - gamma) * g2 ** 2
    x1 -= eta / math.sqrt(s1 + eps) * g1
    x2 -= eta / math.sqrt(s2 + eps) * g2
    return x1, x2, s1, s2

def f_2d(x1, x2):
    return 0.1 * x1 ** 2 + 2 * x2 ** 2

eta, gamma = 0.4, 0.9
d2l.show_trace_2d(f_2d, d2l.train_2d(rmsprop_2d))

def init_rmsprop_states(feature_dim):
    s_w = d2l.zeros((feature_dim, 1))
    s_b = d2l.zeros(1)
    return (s_w, s_b)

def rmsprop(params, states, hyperparams):
    gamma, eps = hyperparams['gamma'], 1e-6
    for p, s in zip(params, states):
        s[:] = gamma * s + (1 - gamma) * np.square(p.grad)
        p[:] -= hyperparams['lr'] * p.grad / np.sqrt(s + eps)

data_iter, feature_dim = d2l.get_data_ch11(batch_size=10)
d2l.train_ch11(rmsprop, init_rmsprop_states(feature_dim),
               {'lr': 0.01, 'gamma': 0.9}, data_iter, feature_dim);

Concise: framework RMSProp

d2l.train_concise_ch11('rmsprop', {'learning_rate': 0.01, 'rho': 0.9},
                       data_iter)

Recap

RMSProp = Adagrad with the accumulator replaced by an EMA: \mathbf{s}_t = \gamma \mathbf{s}_{t-1} + (1-\gamma) \mathbf{g}_t^2.
Standard \gamma = 0.9 → ~10-step effective window.
Effective learning rate doesn’t collapse, so usable in deep non-convex training.
Adam = RMSProp + momentum on the gradient (with bias correction). Coming up.