Multi-GPU from First Principles

The Next Rung: Another GPU

More GPUs buy more compute and more memory. The catch: communication is not free, and on a box with no NVLink it is loud enough to hear.

Plan: build data parallelism by hand, derive the collective the professionals use, then measure what a second GPU costs — and predict, before running, whether it pays.

Three Ways to Split

Data parallel is our subject: simplest, one sync per step, works for any model that fits. Pipeline and tensor parallel wait for the Language Models part.

Data Parallelism by Hand

Split batch → forward/backward per replica → allreduce gradients → identical update. One process; tensors moved explicitly.

def train_batch(X, y, device_params, devices, lr):
    X_shards, y_shards = split_batch(X, y, devices)
    ls = [loss(lenet(Xs, dev_W), ys).sum()
          for Xs, ys, dev_W in zip(X_shards, y_shards, device_params)]
    for l in ls:
        l.backward()
    with torch.no_grad():
        for i in range(len(device_params[0])):
            allreduce([device_params[c][i].grad for c in range(len(devices))])
        for param in device_params:
            d2l.sgd(param, lr, X.shape[0])
            for p in param:
                p.grad = None

Two GPUs, No Speedup

train(num_gpus=min(2, d2l.num_gpus()), batch_size=256, lr=0.2)

test acc 0.79, 2.31 sec/epoch on 2 GPU(s)

Not a bug — the syllabus. LeNet is too small: halving a small batch underutilizes each GPU. Not a bandwidth problem — the whole gradient set is about half a megabyte. Wrong regime, not wrong technique.

Ring Allreduce

Star: hub moves (k-1)N. Ring (reduce-scatter + all-gather): \frac{2(k-1)}{k}N per device — nearly constant, bounded by 2N for any k. The identity that becomes FSDP.

The Accounting

t_{\text{step}}(k) \approx t_{\text{compute}}(B/k) + 2N/\beta

if d2l.num_gpus() >= 2:
    N = 64 * 1024 * 1024  # 64M floats = 256 MB per replica
    data = [torch.randn(N, device=d2l.try_gpu(i)) for i in range(2)]
    t = d2l.Benchmark(lambda: allreduce(data), warmup=2, repeats=5).time
    # Per-device traffic is ~2N bytes; report effective bandwidth
    print(f'allreduce {2 * N * 4 / t / 1e9:.2f} GB/s effective '
          f'over {1000 * t:.1f} ms')
else:
    print('needs 2 GPUs')

allreduce 17.28 GB/s effective over 31.1 ms

Raw copies sustain tens of GB/s (PCIe-limited); NCCL’s fallback transport lands lower — one stage of it is the ceiling; §13.6 measures the env-switch workaround, and its limits. So LeNet’s no-speedup isn’t communication — it’s t_{\text{compute}}(B/k) not shrinking when a small batch is halved. Big model + big batch → the second GPU pays (next section).

Lineage

Parameter servers (push/pull): the asynchronous, multi-machine era; alive in recsys embeddings and other sparse, asynchronous state.
Synchronous collectives (ring allreduce): won for dense training; what DDP runs.

Production map → the Tools appendix. Next: let the library run the ring for us.