Stochastic Gradient Descent

From Full Gradients to SGD

The deep-learning loss is an average:

f(\mathbf{x}) = \frac{1}{n} \sum_{i=1}^{n} f_i(\mathbf{x}).

A full gradient \nabla f costs \mathcal{O}(n) per step. A million-example dataset → a million forward passes per parameter update. Untenable.

Stochastic gradient descent

Pick a random example i and step with \nabla f_i — \mathcal{O}(1) per step, unbiased estimator (\mathbb{E}_i \nabla f_i = \nabla f):

\mathbf{x} \leftarrow \mathbf{x} - \eta \nabla f_i(\mathbf{x}).

The price: noisy gradients. On average the step points the right way; any single step may point almost anywhere.

Setup

%matplotlib inline
from d2l import jax as d2l
import jax
from jax import numpy as jnp
import math

Simulating SGD

We don’t actually need a dataset. Take the same anisotropic f(x_1, x_2) = x_1^2 + 2x_2^2 from the GD section, add \mathcal{N}(0, 1) noise to each gradient component, and watch how the trajectory differs:

def f(x1, x2):  # Objective function
    return x1 ** 2 + 2 * x2 ** 2

def f_grad(x1, x2):  # Gradient of the objective function
    return 2 * x1, 4 * x2

def sgd(x1, x2, s1, s2, f_grad):
    global key
    g1, g2 = f_grad(x1, x2)
    # Simulate noisy gradient: split off a fresh subkey per step, the JAX
    # idiom for drawing a stream of random numbers
    key, subkey = jax.random.split(key)
    n1, n2 = jax.random.normal(subkey, (2,))
    eta_t = eta * lr()
    return (x1 - eta_t * (g1 + n1), x2 - eta_t * (g2 + n2), 0, 0)

key = jax.random.PRNGKey(42)

SGD trajectory

With a constant learning rate, SGD never settles — near the minimum the true gradient vanishes but the noise doesn’t, so the iterates random-walk around the optimum:

def constant_lr():
    return 1

eta = 0.1
lr = constant_lr  # Constant learning rate
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=50, f_grad=f_grad))

epoch 50, x1: -0.141533, x2: -0.069805

The noise ball

Model one coordinate: x_{t+1} = x_t - \eta(\lambda x_t + \xi_t), noise variance \sigma^2. Contraction and noise injection balance at

\mathbb{E}\big[x_\infty^2\big] \approx \frac{\eta\,\sigma^2}{2\lambda}.

Constant \eta → stall at a noise floor proportional to \eta.
Halving \eta halves the floor — and the speed of approach.

Escape: make \eta time-dependent — large early to travel, decaying to quench the noise.

Decay schedules

\begin{aligned} \eta(t) & = \eta_i \textrm{ if } t_i \leq t \leq t_{i+1} && \textrm{piecewise constant} \\ \eta(t) & = \eta_0 \cdot e^{-\lambda t} && \textrm{exponential decay} \\ \eta(t) & = \eta_0 \cdot (\beta t + 1)^{-\alpha} && \textrm{polynomial decay} \end{aligned}

Robbins–Monro (1951): convergence needs \sum_t \eta(t) = \infty (can travel anywhere) and \sum_t \eta(t)^2 < \infty (noise quenched).

Exponential decay: too eager

\sum_t \eta(t) < \infty — a finite travel budget. The iterate stops short of the optimum, out of learning rate:

def exponential_lr():
    # Global variable that is defined outside this function and updated inside
    global t
    t += 1
    return math.exp(-0.1 * t)

t = 1
lr = exponential_lr
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=1000, f_grad=f_grad))

epoch 1000, x1: -0.810909, x2: -0.028332

Polynomial decay

\eta(t) \propto t^{-1/2} keeps exploring long enough and converges much better after only 50 steps:

def polynomial_lr():
    # Global variable that is defined outside this function and updated inside
    global t
    t += 1
    return (1 + 0.1 * t) ** (-0.5)

t = 1
lr = polynomial_lr
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=50, f_grad=f_grad))

epoch 50, x1: 0.037837, x2: 0.002780

Gradient variance vs batch size

In real training the noise comes from which examples land in the minibatch. Averaging b independent example gradients divides the variance by b. Measured on a real network (2-layer MLP, airfoil data), it holds from batch size 1 to 512:

data_iter, feature_dim = d2l.get_data_ch11(batch_size=10)
X = jnp.concatenate([jnp.asarray(Xb) for Xb, yb in data_iter])
y = jnp.concatenate([jnp.asarray(yb) for Xb, yb in data_iter])

k1, k2 = jax.random.split(jax.random.PRNGKey(1))
params = dict(W1=0.1 * jax.random.normal(k1, (feature_dim, 64)),
              b1=jnp.zeros(64),
              W2=0.1 * jax.random.normal(k2, (64, 1)), b2=jnp.zeros(1))

def batch_loss(params, idx):
    h = jax.nn.relu(X[idx] @ params['W1'] + params['b1'])
    return jnp.mean(((h @ params['W2'] + params['b2']).squeeze()
                     - y[idx]) ** 2) / 2

def batch_grad(idx):  # Flattened loss gradient on the minibatch X[idx]
    grads = jax.grad(batch_loss)(params, idx)
    return jnp.concatenate([g.reshape(-1) for g in jax.tree.leaves(grads)])

g_full = batch_grad(jnp.arange(len(y)))
batch_sizes = [1, 8, 64, 512]
var, key = [], jax.random.PRNGKey(0)
for b in batch_sizes:
    key, subkey = jax.random.split(key)
    idx = jax.random.randint(subkey, (200, b), 0, len(y))
    err = jax.vmap(lambda i: ((batch_grad(i) - g_full) ** 2).sum())(idx)
    var.append(float(err.mean()))
d2l.plot(batch_sizes, [var, [var[0] / b for b in batch_sizes]],
         'batch size', 'gradient variance', xscale='log', yscale='log',
         legend=['measured', '1/b'])

Batch size is a dial with diminishing returns

Variance \propto 1/b → noise amplitude \propto 1/\sqrt{b}.
100\times more compute per step → only 10\times quieter gradients.
How to spend compute between \eta and b: next section. What happens at LM scale: the batch-size section later in the chapter.

Recap

SGD: \mathbf{x} \leftarrow \mathbf{x} - \eta \nabla f_i(\mathbf{x}) with random i. Unbiased; \mathcal{O}(1)/step instead of \mathcal{O}(n).
Constant \eta: noise ball with squared radius \approx \eta\sigma^2/(2\lambda).
Decay: \sum \eta_t = \infty, \sum \eta_t^2 < \infty (Robbins–Monro) → convergence.
Minibatch of size b: variance \propto 1/b — verified on a real network.
Proofs: convex rates and nonconvex Ghadimi–Lan, in the math appendix.