Adam

Dive into Deep Learning · §9.6

Per-coordinate learning rates and the modern default
AdaGrad → RMSProp → Adam · a tiny transformer testbed · where Adam wins

One learning rate is not enough

Motivation

Momentum fixed the direction; the step size per coordinate is still one global \eta.

Rare features (rare words, rare users) get meaningful gradients rarely; by the time they arrive, a decayed \eta is too small to use them.
The ill-conditioned valley: the ideal fix is a preconditioner, one step size per direction. Curvature is unaffordable; the gradient’s own history is not.

AdaGrad (Duchi, Hazan & Singer, 2011): decay each coordinate by its own activity,

\mathbf{s}_t = \mathbf{s}_{t-1} + \mathbf{g}_t^2,\qquad \mathbf{x}_{t+1} = \mathbf{x}_t - \frac{\eta}{\sqrt{\mathbf{s}_t + \epsilon}} \odot \mathbf{g}_t.

AdaGrad on the valley

Same quadratic as the momentum section, \eta = 0.4 — smooth, no oscillation, but the accumulated \mathbf{s}_t grinds the steps down:

def adagrad_2d(x1, x2, s1, s2):
    eps = 1e-6
    g1, g2 = 0.2 * x1, 4 * x2
    s1 += g1 ** 2
    s2 += g2 ** 2
    x1 -= eta / math.sqrt(s1 + eps) * g1
    x2 -= eta / math.sqrt(s2 + eps) * g2
    return x1, x2, s1, s2

def f_2d(x1, x2):
    return 0.1 * x1 ** 2 + 2 * x2 ** 2

eta = 0.4
d2l.show_trace_2d(f_2d, d2l.train_2d(adagrad_2d))

epoch 20, x1: -2.382563, x2: -0.158591

The scaling is adaptive, so a formerly unthinkable \eta = 2 is safe:

eta = 2
d2l.show_trace_2d(f_2d, d2l.train_2d(adagrad_2d))

epoch 20, x1: -0.002295, x2: -0.000000

RMSProp: forgetting on purpose

AdaGrad never forgets: \mathbf{s}_t grows forever, steps decay like t^{-1/2} by construction — right for convex problems, wrong for deep nets that need to move late in training.

RMSProp (Hinton, 2012): same rule, leaky average instead of sum,

\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1-\beta_2)\, \mathbf{g}_t^2,

with memory \approx 1/(1-\beta_2) steps. The learning rate becomes an independent knob (→ schedules).

def rmsprop_2d(x1, x2, v1, v2):
    g1, g2, eps = 0.2 * x1, 4 * x2, 1e-6
    v1 = beta2 * v1 + (1 - beta2) * g1 ** 2
    v2 = beta2 * v2 + (1 - beta2) * g2 ** 2
    x1 -= eta / math.sqrt(v1 + eps) * g1
    x2 -= eta / math.sqrt(v2 + eps) * g2
    return x1, x2, v1, v2

eta, beta2 = 0.4, 0.9
d2l.show_trace_2d(f_2d, d2l.train_2d(rmsprop_2d))

epoch 20, x1: -0.010599, x2: 0.000000

Adam = both moments, debiased

\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1-\beta_1)\,\mathbf{g}_t, \qquad \mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1-\beta_2)\,\mathbf{g}_t^2,

\hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1-\beta_1^t},\quad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1-\beta_2^t},\qquad \mathbf{x}_{t+1} = \mathbf{x}_t - \eta\, \frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon}.

Defaults \beta_1 = 0.9, \beta_2 = 0.999: direction averaged over ~10 steps, scale over ~1000.
Bias correction is exact: \mathbb{E}[\mathbf{v}_t] carries the fraction 1-\beta_2^t of the true scale; division cancels it at every t. After 10 steps, \mathbf{v}_t holds ~1% of its stationary value — uncorrected Adam takes its biggest steps on its worst estimates.

From scratch

Two buffers per parameter plus a step counter:

def init_adam_states(feature_dim):
    m_w, m_b = d2l.zeros((feature_dim, 1)), d2l.zeros(1)
    v_w, v_b = d2l.zeros((feature_dim, 1)), d2l.zeros(1)
    return ((m_w, v_w), (m_b, v_b))

def adam(params, states, hyperparams):
    beta1, beta2, eps = 0.9, 0.999, 1e-6
    for p, (m, v) in zip(params, states):
        with torch.no_grad():
            m[:] = beta1 * m + (1 - beta1) * p.grad
            v[:] = beta2 * v + (1 - beta2) * torch.square(p.grad)
            m_hat = m / (1 - beta1 ** hyperparams['t'])
            v_hat = v / (1 - beta2 ** hyperparams['t'])
            p[:] -= hyperparams['lr'] * m_hat / (torch.sqrt(v_hat) + eps)
        p.grad.zero_()
    hyperparams['t'] += 1

d2l.train_ch11(adam, init_adam_states(feature_dim),
               {'lr': 0.01, 't': 1}, data_iter, feature_dim);

loss: 0.243, 0.100 sec/epoch

A tiny language model

The differences that matter show on language models. TinyLM: a decoder-only transformer (subject of ch. 10 — here, a black box), on the character-level Time Machine from ch. 8.

class TinyLM(nn.Module):
    """A small decoder-only transformer language model."""
    def __init__(self, vocab_size, d_model=128, num_heads=2, num_blks=2,
                 max_len=64):
        super().__init__()
        self.num_heads = num_heads
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_len, d_model)
        self.blks = nn.ModuleList([nn.ModuleDict(dict(
            norm1=nn.LayerNorm(d_model),
            qkv=nn.Linear(d_model, 3 * d_model),
            proj=nn.Linear(d_model, d_model),
            norm2=nn.LayerNorm(d_model),
            mlp=nn.Sequential(nn.Linear(d_model, 4 * d_model), nn.GELU(),
                              nn.Linear(4 * d_model, d_model))))
            for _ in range(num_blks)])
        self.norm = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size)

    def attention(self, blk, X):
        B, T, D = X.shape
        q, k, v = blk['qkv'](X).chunk(3, dim=-1)
        q, k, v = (u.reshape(B, T, self.num_heads, -1).transpose(1, 2)
                   for u in (q, k, v))
        Y = F.scaled_dot_product_attention(q, k, v, is_causal=True)
        return blk['proj'](Y.transpose(1, 2).reshape(B, T, D))

    def forward(self, X):
        H = self.token_emb(X) + self.pos_emb(torch.arange(X.shape[1],
                                                          device=X.device))
        for blk in self.blks:
            H = H + self.attention(blk, blk['norm1'](H))
            H = H + blk['mlp'](blk['norm2'](H))
        return self.head(self.norm(H))

A differentiable function with a particular census of parameters — the census, not the mechanism, is what optimization sees.

The parameter census

Three populations — later sections treat them differently (decay exclusions, matrix vs. non-matrix preconditioning):

parameter              shape            count
token_emb.weight       (28, 128)         3584
pos_emb.weight         (64, 128)         8192
blks.0.norm1.weight    (128,)             128
blks.0.norm1.bias      (128,)             128
blks.0.qkv.weight      (384, 128)       49152
...
head.weight            (28, 128)         3584
head.bias              (28,)               28
                           embeddings   11776
                             matrices  396800
                              vectors    3612
                                total  412188

embeddings: rows update only when their token occurs
matrices: ~96% of all parameters
vectors: LayerNorm scales and biases

The race: tuned SGD vs. tuned Adam

Symmetric protocol: same model, same init, same 2,000 steps, constant learning rate, four-point grid each, best final training loss wins.

On the language model:

final perplexity: SGD 2.92, Adam 2.49

SGD’s best lr sits one grid point below divergence; Adam’s optimum is interior.
The gap opens early and no rate in our grid closes it.

Same race, same harness, on a CNN

test accuracy: SGD 0.904, Adam 0.919

Curves nearly coincide; test accuracy within a point or two, either way.
This is why SGD carried computer vision for a decade — and why “which optimizer wins” depends on the model, not just the tuning.

Why: heterogeneity

Live research, but the threads agree (Kunstner et al. 2023, 2024; Zhang et al. 2024):

Not minibatch noise: the gap persists full-batch; Adam tracks sign descent (\sqrt{\hat{\mathbf{v}}} \approx |\hat{\mathbf{m}}| → step \approx \eta\,\mathrm{sign}(\hat{\mathbf{m}})).
Language is heavy-tailed: GD stalls on rare tokens; Adam keeps moving on all of them.
Transformer blocks have wildly different curvature; CNN blocks look alike. One global \eta must serve the stiffest block and starve the rest.

Recap

Adam = per-coordinate scaling (AdaGrad) + forgetting (RMSProp) + momentum + exact bias correction.
Cost: two state buffers per parameter — 3× parameter memory.
Wins big on transformers (heterogeneity), little on CNNs — at matched tuning.
\epsilon is a step ceiling \eta/\epsilon, not just a numerical guard; AMSGrad/Yogi patch the variance estimate (exercises).