Training the airfoil model

Adadelta

Adadelta

Adadelta (Zeiler, 2012) takes RMSProp further: adapt per-parameter step magnitudes and remove the global learning rate entirely.

Keeps two EMAs — one over squared gradients, one over squared updates. The ratio of their square roots is dimensionally consistent — a “unitless” step size, so no separate \eta needed (in principle).

The update rule

\mathbf{s}_t = \rho \mathbf{s}_{t-1} + (1-\rho) \mathbf{g}_t^2,

\mathbf{u}_t = \rho \mathbf{u}_{t-1} + (1-\rho)(\Delta\mathbf{x}_t)^2,

where \mathbf{s}_t tracks squared gradients and \mathbf{u}_t tracks squared parameter updates. The actual step uses the previous update scale:

\Delta\mathbf{x}_t = -\frac{\sqrt{\mathbf{u}_{t-1}+\epsilon}} {\sqrt{\mathbf{s}_t+\epsilon}} \odot \mathbf{g}_t,

\mathbf{x}_t \leftarrow \mathbf{x}_{t-1} + \Delta\mathbf{x}_t.

In practice frameworks still expose a learning-rate hyper for fine-tuning.

From-scratch implementation

Two state buffers per parameter (s and delta):

%matplotlib inline
from d2l import mxnet as d2l
from mxnet import np, npx
npx.set_np()

def init_adadelta_states(feature_dim):
    s_w, s_b = d2l.zeros((feature_dim, 1)), d2l.zeros(1)
    delta_w, delta_b = d2l.zeros((feature_dim, 1)), d2l.zeros(1)
    return ((s_w, delta_w), (s_b, delta_b))

def adadelta(params, states, hyperparams):
    rho, eps = hyperparams['rho'], 1e-5
    for p, (s, delta) in zip(params, states):
        # In-place updates via [:]
        s[:] = rho * s + (1 - rho) * np.square(p.grad)
        g = (np.sqrt(delta + eps) / np.sqrt(s + eps)) * p.grad
        p[:] -= g
        delta[:] = rho * delta + (1 - rho) * g * g
data_iter, feature_dim = d2l.get_data_ch11(batch_size=10)
d2l.train_ch11(adadelta, init_adadelta_states(feature_dim),
               {'rho': 0.9}, data_iter, feature_dim);

The loss curve should be read as a scale-adaptation demo: Adadelta trains without the hand-picked global learning rate that earlier optimizers needed.

Concise: framework Adadelta

d2l.train_concise_ch11('adadelta', {'rho': 0.9}, data_iter)

Recap

  • Two EMAs: squared gradients \mathbf{s}_t and squared updates \mathbf{u}_t.
  • Per-parameter step is the ratio \sqrt{\mathbf{u}_{t-1}}/\sqrt{\mathbf{s}_t}, dimensionally consistent — drops the explicit learning rate.
  • Less popular today than Adam, but a good case study in scale-invariant optimization design.