Training the airfoil model

Adadelta

Adadelta

Adadelta (Zeiler, 2012) takes RMSProp further: adapt per-parameter step magnitudes and remove the global learning rate entirely.

Keeps two EMAs — one over squared gradients, one over squared updates. The ratio of their square roots is dimensionally consistent — a “unitless” step size, so no separate \eta needed (in principle).

The update rule

\mathbf{s}_t = \rho \mathbf{s}_{t-1} + (1-\rho) \mathbf{g}_t^2,

\mathbf{u}_t = \rho \mathbf{u}_{t-1} + (1-\rho)(\Delta\mathbf{x}_t)^2,

where \mathbf{s}_t tracks squared gradients and \mathbf{u}_t tracks squared parameter updates. The actual step uses the previous update scale:

\Delta\mathbf{x}_t = -\frac{\sqrt{\mathbf{u}_{t-1}+\epsilon}} {\sqrt{\mathbf{s}_t+\epsilon}} \odot \mathbf{g}_t,

\mathbf{x}_t \leftarrow \mathbf{x}_{t-1} + \Delta\mathbf{x}_t.

In practice frameworks still expose a learning-rate hyper for fine-tuning.

From-scratch implementation

Two state buffers per parameter (s and delta):

%matplotlib inline
from d2l import torch as d2l
import torch

def init_adadelta_states(feature_dim):
    s_w, s_b = d2l.zeros((feature_dim, 1)), d2l.zeros(1)
    delta_w, delta_b = d2l.zeros((feature_dim, 1)), d2l.zeros(1)
    return ((s_w, delta_w), (s_b, delta_b))

def adadelta(params, states, hyperparams):
    rho, eps = hyperparams['rho'], 1e-5
    for p, (s, delta) in zip(params, states):
        with torch.no_grad():
            # In-place updates via [:]
            s[:] = rho * s + (1 - rho) * torch.square(p.grad)
            g = (torch.sqrt(delta + eps) / torch.sqrt(s + eps)) * p.grad
            p[:] -= g
            delta[:] = rho * delta + (1 - rho) * g * g
        p.grad.data.zero_()
data_iter, feature_dim = d2l.get_data_ch11(batch_size=10)
d2l.train_ch11(adadelta, init_adadelta_states(feature_dim),
               {'rho': 0.9}, data_iter, feature_dim);

loss: 0.243, 0.100 sec/epoch

The loss curve should be read as a scale-adaptation demo: Adadelta trains without the hand-picked global learning rate that earlier optimizers needed.

Concise: framework Adadelta

trainer = torch.optim.Adadelta
d2l.train_concise_ch11(trainer, {'rho': 0.9}, data_iter)

loss: 0.243, 0.079 sec/epoch

Recap

  • Two EMAs: squared gradients \mathbf{s}_t and squared updates \mathbf{u}_t.
  • Per-parameter step is the ratio \sqrt{\mathbf{u}_{t-1}}/\sqrt{\mathbf{s}_t}, dimensionally consistent — drops the explicit learning rate.
  • Less popular today than Adam, but a good case study in scale-invariant optimization design.