Minibatches

GD: \mathcal{O}(n) per step, exact. SGD: \mathcal{O}(1) per step, noisy. Everything we ever trained sat in between — a minibatch of b examples:

\mathbf{w} \leftarrow \mathbf{w} - \frac{\eta}{b} \sum_{i \in \mathcal{B}} \nabla f_i(\mathbf{w}).

Statistics (last section): variance \propto 1/b.
This section: the mechanics — why b at once costs far less than b one at a time.

Arithmetic outruns memory

Server CPU: 10^{12}–10^{13} FLOP/s vs a few 100 GB/s of bandwidth.
GPU: 10^{14+} FLOP/s vs a few TB/s.
Ratio ≈ two orders of magnitude: each byte loaded must feed tens to hundreds of operations.

Caches bridge the gap only if the algorithm reuses resident data — blocked matmul, not elementwise loops.

Plus dispatch overhead: every op launched from Python costs microseconds; the arithmetic inside a tiny op costs nanoseconds.

Setup

%matplotlib inline
from d2l import jax as d2l
from flax import nnx
import jax
from jax import numpy as jnp
import numpy as np
import optax
import time

A = jnp.zeros((256, 256))
B = jnp.array(np.random.normal(0, 1, (256, 256)))
C = jnp.array(np.random.normal(0, 1, (256, 256)))

class Timer:
    """Record multiple running times."""
    def __init__(self):
        self.times = []
        self.start()

    def start(self):
        """Start the timer."""
        self.tik = time.time()

    def stop(self):
        """Stop the timer and record the time in a list."""
        self.times.append(time.time() - self.tik)
        return self.times[-1]

    def avg(self):
        """Return the average time."""
        return sum(self.times) / len(self.times)

    def sum(self):
        """Return the sum of time."""
        return sum(self.times)

    def cumsum(self):
        """Return the accumulated time."""
        return np.array(self.times).cumsum().tolist()

timer = Timer()

Three loops, three speeds

Compute \mathbf{A} = \mathbf{B}\mathbf{C} on 256 \times 256 matrices, exposing more work per call each time:

# Compute A = BC one element at a time. JAX is functionally pure, so a
# literal `A.at[i, j].set(...)` would copy the full matrix on every write
# (O(n^2) memory traffic), turning a demo into a multi-minute run. We
# instead use a NumPy buffer to mirror the eager semantics PyTorch gets
# for free; the *point* of this cell is that the elementwise dispatch
# is much slower than vectorized matmul.
A = np.zeros((256, 256), dtype=np.float32)
B_np = np.array(B)
C_np = np.array(C)
timer.start()
for i in range(256):
    for j in range(256):
        A[i, j] = np.dot(B_np[i, :], C_np[:, j])
timer.stop()

0.1705946922302246

# Compute A = BC one column at a time. We keep B/C on device; only the
# Python loop and per-column dispatch cost remain.
A = jnp.zeros((256, 256))
timer.start()
for j in range(256):
    A = A.at[:, j].set(jnp.dot(B, C[:, j]))
A.block_until_ready()
timer.stop()

1.7103769779205322

# Compute A = BC in one go
timer.start()
A = jnp.dot(B, C)
A.block_until_ready()
timer.stop()

gigaflops = [0.03 / i for i in timer.times]
print(f'performance in Gigaflops: element {gigaflops[0]:.3f}, '
      f'column {gigaflops[1]:.3f}, full {gigaflops[2]:.3f}')

performance in Gigaflops: element 0.176, column 0.018, full 0.016

Same arithmetic, orders of magnitude apart. The loop is overhead; the cache and vector units do the work.

Batching in blocks

64 columns at a time — a “minibatch” of the matmul:

timer.start()
for j in range(0, 256, 64):
    A = A.at[:, j:j+64].set(jnp.dot(B, C[:, j:j+64]))
A.block_until_ready()
timer.stop()
print(f'performance in Gigaflops: block {0.03 / timer.times[3]:.3f}')

performance in Gigaflops: block 0.019

Already as fast as the full multiplication: modest batches amortize essentially all the overhead.

Two reasons to batch, kept apart: hardware (this section) and variance (last section). Both saturate.
How large is too large → critical batch size, later in this chapter.
Batch exceeds memory? Gradient accumulation (ch. on performance).

Airfoil dataset

Real regression data for the whole chapter — 1500 examples, 5 features, whitened; every run takes seconds:

d2l.DATA_HUB['airfoil'] = (d2l.DATA_URL + 'airfoil_self_noise.dat',
                           '76e5be1548fd8222e5074cf0faae75edff8cf93f')


def get_data_ch11(batch_size=10, n=1500):
    data = np.genfromtxt(d2l.download('airfoil'),
                         dtype=np.float32, delimiter='\t')
    data = (data - data.mean(axis=0)) / data.std(axis=0)
    data_iter = d2l.load_array(
        (jnp.array(data[:n, :-1]), jnp.array(data[:n, -1])),
        batch_size, is_train=True)
    return data_iter, data.shape[1]-1

The optimizer interface

Every optimizer in this chapter: (params, states, hyperparams) — states carries the algorithm’s memory (empty for SGD):

def sgd(params, grads, states, hyperparams):
    updated = []
    for param, grad in zip(params, grads):
        updated.append(param - hyperparams['lr'] * grad)
    return updated

Generic training harness

Linear regression + any update rule; loss recorded against wall-clock time:

def train_ch11(trainer_fn, states, hyperparams, data_iter,
               feature_dim, num_epochs=2):
    # Initialization
    w = jnp.array(np.random.normal(scale=0.01, size=(feature_dim, 1)),
                  dtype=jnp.float32)
    b = jnp.zeros(1)
    net, loss = lambda X: d2l.linreg(X, w, b), d2l.squared_loss
    # Train
    animator = d2l.Animator(xlabel='epoch', ylabel='loss',
                            xlim=[0, num_epochs], ylim=[0.22, 0.35])
    n, timer = 0, d2l.Timer()
    # JIT only the grad computation; the optimizer update runs eagerly so
    # that stateful optimizers can mutate `states` without triggering JAX
    # tracer-leak errors from closure side-effects inside jit.
    @jax.jit
    def compute_grads(w, b, X, y):
        def loss_fn(w, b):
            return d2l.squared_loss(d2l.linreg(X, w, b), y).mean()
        return jax.grad(loss_fn, argnums=(0, 1))(w, b)
    # Pre-stack the full dataset on device so the periodic evaluate_loss
    # stays inside one compiled call instead of looping in Python.
    eval_batches = [(jnp.array(X), jnp.array(y)) for X, y in data_iter]
    Xs = jnp.concatenate([X for X, _ in eval_batches], axis=0)
    ys = jnp.concatenate([y for _, y in eval_batches], axis=0)
    @jax.jit
    def full_eval(w, b):
        out = d2l.linreg(Xs, w, b)
        y_r = ys.reshape(out.shape)
        return ((out - y_r) ** 2 / 2).mean()
    for _ in range(num_epochs):
        for X, y in data_iter:
            X, y = jnp.array(X), jnp.array(y)
            grads = compute_grads(w, b, X, y)
            w, b = trainer_fn([w, b], list(grads), states, hyperparams)
            n += X.shape[0]
            if n % 200 == 0:
                timer.stop()
                animator.add(n/X.shape[0]/len(data_iter),
                             (float(full_eval(w, b)),))
                timer.start()
    print(f'loss: {animator.Y[0][-1]:.3f}, {timer.sum()/num_epochs:.3f} sec/epoch')
    return timer.cumsum(), animator.Y[0]

The race: full batch vs single example

b = 1500: one well-aimed update per epoch — stalls after ~6 steps:

def train_sgd(lr, batch_size, num_epochs=2):
    data_iter, feature_dim = get_data_ch11(batch_size)
    return train_ch11(
        sgd, None, {'lr': lr}, data_iter, feature_dim, num_epochs)

gd_res = train_sgd(1, 1500, 10)

loss: 0.245, 0.090 sec/epoch

b = 1: 1500 updates per epoch, but 1500 tiny dispatches — more clock time per epoch than GD:

sgd_res = train_sgd(0.005, 1)

loss: 0.244, 7.417 sec/epoch

The middle wins

mini1_res = train_sgd(.4, 100)

loss: 0.245, 0.514 sec/epoch

mini2_res = train_sgd(.05, 10)

loss: 0.245, 1.153 sec/epoch

d2l.set_figsize([6, 3])
d2l.plot(*list(map(list, zip(gd_res, sgd_res, mini1_res, mini2_res))),
         'time (sec)', 'loss', xlim=[1e-2, 10], xscale='log',
         legend=['gd', 'sgd', 'batch size=100', 'batch size=10'])

Read the x-axis as elapsed seconds: b=100 beats both extremes.

Concise: framework optimizer

Same experiment through the built-in optimizer — the harness the rest of the chapter reuses:

def train_concise_ch11(trainer_fn, hyperparams, data_iter, num_epochs=2):
    # Initialization
    net = nnx.Linear(5, 1, rngs=nnx.Rngs(0))
    optimizer = nnx.Optimizer(
        net, trainer_fn(**hyperparams), wrt=nnx.Param)

    loss = lambda pred, y: jnp.mean((pred - y) ** 2) / 2
    animator = d2l.Animator(xlabel='epoch', ylabel='loss',
                            xlim=[0, num_epochs], ylim=[0.22, 0.35])
    n, timer = 0, d2l.Timer()
    # JIT-fuse the per-batch optimizer update so per-step Python overhead
    # stays out of the inner loop.
    @nnx.jit
    def step(model, optimizer, X, y):
        def loss_fn(model):
            out = model(X)
            y_reshaped = y.reshape(out.shape)
            return jnp.mean((out - y_reshaped) ** 2) / 2
        l, grads = nnx.value_and_grad(loss_fn)(model)
        optimizer.update(model, grads)
        return l

    # Pre-stack the full dataset on device so the periodic full-loss
    # evaluation is a single compiled call.
    eval_batches = [(jnp.array(X), jnp.array(y)) for X, y in data_iter]
    Xs = jnp.concatenate([X for X, _ in eval_batches], axis=0)
    ys = jnp.concatenate([y for _, y in eval_batches], axis=0)
    @nnx.jit
    def full_eval(model):
        out = model(Xs)
        y_r = ys.reshape(out.shape)
        return jnp.mean((out - y_r) ** 2) / 2
    for _ in range(num_epochs):
        for X, y in data_iter:
            X, y = jnp.array(X), jnp.array(y)
            step(net, optimizer, X, y)
            n += X.shape[0]
            if n % 200 == 0:
                timer.stop()
                animator.add(n/X.shape[0]/len(data_iter),
                             (float(full_eval(net)),))
                timer.start()
    print(f'loss: {animator.Y[0][-1]:.3f}, {timer.sum()/num_epochs:.3f} sec/epoch')

data_iter, _ = get_data_ch11(10)
trainer = optax.sgd
train_concise_ch11(trainer, {'learning_rate': 0.05}, data_iter)

loss: 0.244, 0.804 sec/epoch

Recap

Minibatch SGD interpolates between GD and SGD and beats both on the wall clock.
The win is mechanical: dispatch amortization, cache reuse, vector units — plus the 1/b variance cut from last section.
Pick b to fill the accelerator within memory; the statistical ceiling (critical batch size) comes later in the chapter.