Simulating SGD

Stochastic Gradient Descent

From Full Gradients to SGD

The deep-learning loss is an average:

f(\mathbf{x}) = \frac{1}{n} \sum_{i=1}^{n} f_i(\mathbf{x}).

A full gradient \nabla f costs \mathcal{O}(n) per step. A million-example dataset → a million forward passes per parameter update. Untenable.

Stochastic gradient descent

Pick a random example i and step with \nabla f_i — \mathcal{O}(1) per step, unbiased estimator (\mathbb{E}_i \nabla f_i = \nabla f):

\mathbf{x} \leftarrow \mathbf{x} - \eta \nabla f_i(\mathbf{x}).

The price: noisy gradients. They blur the trajectory, but also help escape narrow local minima — a double-edged property this chapter unpacks.

Setup

%matplotlib inline
from d2l import mxnet as d2l
import math
import random
from mxnet import np, npx
npx.set_np()

We don’t actually need a dataset. Take the same anisotropic f(x_1, x_2) = x_1^2 + 2x_2^2 from the GD section, add \mathcal{N}(0, 1) noise to each gradient component, and watch how the trajectory differs:

def f(x1, x2):  # Objective function
    return x1 ** 2 + 2 * x2 ** 2

def f_grad(x1, x2):  # Gradient of the objective function
    return 2 * x1, 4 * x2

def sgd(x1, x2, s1, s2, f_grad):
    g1, g2 = f_grad(x1, x2)
    # Simulate noisy gradient (Python's random.gauss avoids a GPU sync per
    # step that a framework-tensor .item() would force in this 1000-step
    # demo; the noise is scalar so a framework tensor buys nothing).
    g1 += random.gauss(0, 1)
    g2 += random.gauss(0, 1)
    eta_t = eta * lr()
    return (x1 - eta_t * g1, x2 - eta_t * g2, 0, 0)

SGD trajectory

With constant learning rate, SGD oscillates around the minimum forever — the variance of the noise sets a floor on how close it gets:

def constant_lr():
    return 1

eta = 0.1
lr = constant_lr  # Constant learning rate
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=50, f_grad=f_grad))

Why decaying learning rate

Constant \eta → \mathcal{O}(\eta) noise floor. Decay \eta over time → converges to the minimum.

Common schedules:

Inverse: \eta_t = \eta_0 / (1 + \beta t)
Polynomial: \eta_t = \eta_0 (1 + \beta t)^{-\alpha}, \alpha \in (0.5, 1)
Exponential: \eta_t = \eta_0 \cdot \alpha^t, 0 < \alpha < 1
Piecewise constant: drop by 10× every K epochs

A decay schedule in code

def exponential_lr():
    # Global variable that is defined outside this function and updated inside
    global t
    t += 1
    return math.exp(-0.1 * t)

t = 1
lr = exponential_lr
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=1000, f_grad=f_grad))

Decay schedule comparison

Exponential decay reduces variance quickly, but can shrink the step size too fast. Polynomial inverse-square-root decay keeps exploration longer and converges better in this example:

def polynomial_lr():
    # Global variable that is defined outside this function and updated inside
    global t
    t += 1
    return (1 + 0.1 * t) ** (-0.5)

t = 1
lr = polynomial_lr
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=50, f_grad=f_grad))

Recap

SGD: \mathbf{x} \leftarrow \mathbf{x} - \eta \nabla f_i(\mathbf{x}) with random i. Unbiased; \mathcal{O}(1)/step instead of \mathcal{O}(n).
Constant \eta: bounces around the minimum forever.
Decay schedules (1/t, polynomial, exponential, step) give convergence in expectation; the right schedule depends on the problem.
Noise is sometimes a feature: knocks parameters out of narrow local basins. Minibatch SGD (next) tames the variance with a bit of averaging.