Stochastic Gradient Descent

From Full Gradients to SGD

The deep-learning loss is an average:

f(\mathbf{x}) = \frac{1}{n} \sum_{i=1}^{n} f_i(\mathbf{x}).

A full gradient \nabla f costs \mathcal{O}(n) per step. A million-example dataset → a million forward passes per parameter update. Untenable.

Stochastic gradient descent

Pick a random example i and step with \nabla f_i — \mathcal{O}(1) per step, unbiased estimator (\mathbb{E}_i \nabla f_i = \nabla f):

\mathbf{x} \leftarrow \mathbf{x} - \eta \nabla f_i(\mathbf{x}).

The price: noisy gradients. On average the step points the right way; any single step may point almost anywhere.

Setup

%matplotlib inline
from d2l import torch as d2l
import math
import random
import torch

Simulating SGD

We don’t actually need a dataset. Take the same anisotropic f(x_1, x_2) = x_1^2 + 2x_2^2 from the GD section, add \mathcal{N}(0, 1) noise to each gradient component, and watch how the trajectory differs:

def f(x1, x2):  # Objective function
    return x1 ** 2 + 2 * x2 ** 2

def f_grad(x1, x2):  # Gradient of the objective function
    return 2 * x1, 4 * x2

def sgd(x1, x2, s1, s2, f_grad):
    g1, g2 = f_grad(x1, x2)
    # Simulate noisy gradient (Python's random.gauss avoids a GPU sync per
    # step that a framework-tensor .item() would force in this 1000-step
    # demo; the noise is scalar so a framework tensor buys nothing).
    g1 += random.gauss(0, 1)
    g2 += random.gauss(0, 1)
    eta_t = eta * lr()
    return (x1 - eta_t * g1, x2 - eta_t * g2, 0, 0)

SGD trajectory

With a constant learning rate, SGD never settles — near the minimum the true gradient vanishes but the noise doesn’t, so the iterates random-walk around the optimum:

def constant_lr():
    return 1

eta = 0.1
lr = constant_lr  # Constant learning rate
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=50, f_grad=f_grad))

epoch 50, x1: -0.250367, x2: 0.000626

The noise ball

Model one coordinate: x_{t+1} = x_t - \eta(\lambda x_t + \xi_t), noise variance \sigma^2. Contraction and noise injection balance at

\mathbb{E}\big[x_\infty^2\big] \approx \frac{\eta\,\sigma^2}{2\lambda}.

Constant \eta → stall at a noise floor proportional to \eta.
Halving \eta halves the floor — and the speed of approach.

Escape: make \eta time-dependent — large early to travel, decaying to quench the noise.

Decay schedules

\begin{aligned} \eta(t) & = \eta_i \textrm{ if } t_i \leq t \leq t_{i+1} && \textrm{piecewise constant} \\ \eta(t) & = \eta_0 \cdot e^{-\lambda t} && \textrm{exponential decay} \\ \eta(t) & = \eta_0 \cdot (\beta t + 1)^{-\alpha} && \textrm{polynomial decay} \end{aligned}

Robbins–Monro (1951): convergence needs \sum_t \eta(t) = \infty (can travel anywhere) and \sum_t \eta(t)^2 < \infty (noise quenched).

Exponential decay: too eager

\sum_t \eta(t) < \infty — a finite travel budget. The iterate stops short of the optimum, out of learning rate:

def exponential_lr():
    # Global variable that is defined outside this function and updated inside
    global t
    t += 1
    return math.exp(-0.1 * t)

t = 1
lr = exponential_lr
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=1000, f_grad=f_grad))

epoch 1000, x1: -0.769816, x2: -0.084568

Polynomial decay

\eta(t) \propto t^{-1/2} keeps exploring long enough and converges much better after only 50 steps:

def polynomial_lr():
    # Global variable that is defined outside this function and updated inside
    global t
    t += 1
    return (1 + 0.1 * t) ** (-0.5)

t = 1
lr = polynomial_lr
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=50, f_grad=f_grad))

epoch 50, x1: -0.020206, x2: -0.012148

Gradient variance vs batch size

In real training the noise comes from which examples land in the minibatch. Averaging b independent example gradients divides the variance by b. Measured on a real network (2-layer MLP, airfoil data), it holds from batch size 1 to 512:

data_iter, feature_dim = d2l.get_data_ch11(batch_size=10)
X = torch.cat([Xb for Xb, yb in data_iter])
y = torch.cat([yb for Xb, yb in data_iter])

torch.manual_seed(1)
W1, b1 = torch.randn(feature_dim, 64) * 0.1, torch.zeros(64)
W2, b2 = torch.randn(64, 1) * 0.1, torch.zeros(1)
params = [W1, b1, W2, b2]
for p in params:
    p.requires_grad_(True)

def batch_grad(idx):  # Flattened loss gradient on the minibatch X[idx]
    h = torch.relu(X[idx] @ W1 + b1)
    loss = ((h @ W2 + b2).squeeze() - y[idx]).pow(2).mean() / 2
    return torch.cat([g.reshape(-1) for g in torch.autograd.grad(loss, params)])

g_full = batch_grad(torch.arange(len(y)))
batch_sizes = [1, 8, 64, 512]
var = []
for b in batch_sizes:
    idx = torch.randint(0, len(y), (200, b))
    var.append(torch.stack([((batch_grad(i) - g_full) ** 2).sum()
                            for i in idx]).mean().item())
d2l.plot(batch_sizes, [var, [var[0] / b for b in batch_sizes]],
         'batch size', 'gradient variance', xscale='log', yscale='log',
         legend=['measured', '1/b'])

Batch size is a dial with diminishing returns

Variance \propto 1/b → noise amplitude \propto 1/\sqrt{b}.
100\times more compute per step → only 10\times quieter gradients.
How to spend compute between \eta and b: next section. What happens at LM scale: the batch-size section later in the chapter.

Recap

SGD: \mathbf{x} \leftarrow \mathbf{x} - \eta \nabla f_i(\mathbf{x}) with random i. Unbiased; \mathcal{O}(1)/step instead of \mathcal{O}(n).
Constant \eta: noise ball with squared radius \approx \eta\sigma^2/(2\lambda).
Decay: \sum \eta_t = \infty, \sum \eta_t^2 < \infty (Robbins–Monro) → convergence.
Minibatch of size b: variance \propto 1/b — verified on a real network.
Proofs: convex rates and nonconvex Ghadimi–Lan, in the math appendix.