Training the airfoil model

Adadelta

Adadelta

Adadelta (Zeiler, 2012) takes RMSProp further: adapt per-parameter step magnitudes and remove the global learning rate entirely.

Keeps two EMAs — one over squared gradients, one over squared updates. The ratio of their square roots is dimensionally consistent — a “unitless” step size, so no separate \eta needed (in principle).

The update rule

\mathbf{s}_t = \rho \mathbf{s}_{t-1} + (1-\rho) \mathbf{g}_t^2,

\mathbf{u}_t = \rho \mathbf{u}_{t-1} + (1-\rho)(\Delta\mathbf{x}_t)^2,

where \mathbf{s}_t tracks squared gradients and \mathbf{u}_t tracks squared parameter updates. The actual step uses the previous update scale:

\Delta\mathbf{x}_t = -\frac{\sqrt{\mathbf{u}_{t-1}+\epsilon}} {\sqrt{\mathbf{s}_t+\epsilon}} \odot \mathbf{g}_t,

\mathbf{x}_t \leftarrow \mathbf{x}_{t-1} + \Delta\mathbf{x}_t.

In practice frameworks still expose a learning-rate hyper for fine-tuning.

From-scratch implementation

Two state buffers per parameter (s and delta):

%matplotlib inline
from d2l import jax as d2l
import jax
from jax import numpy as jnp
import numpy as np

def init_adadelta_states(feature_dim):
    s_w, s_b = jnp.zeros((feature_dim, 1)), jnp.zeros(1)
    delta_w, delta_b = jnp.zeros((feature_dim, 1)), jnp.zeros(1)
    return [(s_w, delta_w), (s_b, delta_b)]

def adadelta(params, grads, states, hyperparams):
    rho, eps = hyperparams['rho'], 1e-5
    for i, (p, (s, delta), grad) in enumerate(zip(params, states, grads)):
        s = rho * s + (1 - rho) * jnp.square(grad)
        g = (jnp.sqrt(delta + eps) / jnp.sqrt(s + eps)) * grad
        params[i] = p - g
        states[i] = (s, rho * delta + (1 - rho) * g * g)
    return params[0], params[1]
data_iter, feature_dim = d2l.get_data_ch11(batch_size=10)
d2l.train_ch11(adadelta, init_adadelta_states(feature_dim),
               {'rho': 0.9}, data_iter, feature_dim);

loss: 0.243, 1.190 sec/epoch

The loss curve should be read as a scale-adaptation demo: Adadelta trains without the hand-picked global learning rate that earlier optimizers needed.

Concise: framework Adadelta

import optax
trainer = optax.adadelta
d2l.train_concise_ch11(trainer, {'learning_rate': 0.9}, data_iter)

loss: 0.310, 0.719 sec/epoch

Recap

  • Two EMAs: squared gradients \mathbf{s}_t and squared updates \mathbf{u}_t.
  • Per-parameter step is the ratio \sqrt{\mathbf{u}_{t-1}}/\sqrt{\mathbf{s}_t}, dimensionally consistent — drops the explicit learning rate.
  • Less popular today than Adam, but a good case study in scale-invariant optimization design.