Multi-GPU from First Principles

The Next Rung: Another GPU

More GPUs buy more compute and more memory. The catch: communication is not free, and on a box with no NVLink it is loud enough to hear.

Plan: build data parallelism by hand, derive the collective the professionals use, then measure what a second GPU costs — and predict, before running, whether it pays.

Three Ways to Split

Data parallel is our subject: simplest, one sync per step, works for any model that fits. Pipeline and tensor parallel wait for the Language Models part.

Data Parallelism by Hand

Split batch → forward/backward per replica → allreduce gradients → identical update. One process; tensors moved explicitly.

@partial(jax.jit, static_argnames=('lr', 'mesh'))
def train_step(params, X, y, lr, mesh):
    """One data-parallel step: shard_map makes the pmean collective explicit.
    `X`, `y` arrive with the batch sharded across devices (P('data')) and
    `params` replicated (P()) -- see `train` below; shard_map hands each device
    the full parameter replica and its own batch shard. pcast marks the replica
    as this device's own local copy, so the gradient below is the shard's own;
    pmean then averages the shard gradients across devices."""
    P = jax.sharding.PartitionSpec

    def per_device(params, X, y):
        def loss_fn(p):
            logits = lenet(p, X[0])   # X[0]: strip the size-1 sharded axis
            return optax.softmax_cross_entropy_with_integer_labels(
                logits, y[0]).mean()
        local = jax.lax.pcast(params, 'data', to='varying')  # my own copy
        grads = jax.grad(loss_fn)(local)        # my shard's mean-loss gradient
        grads = jax.lax.pmean(grads, 'data')    # The allreduce, in one line
        return jax.tree.map(lambda p, g: p - lr * g, params, grads)

    step = jax.shard_map(per_device, mesh=mesh,
                         in_specs=(P(), P('data'), P('data')),
                         out_specs=P())
    return step(params, X, y)

Two GPUs, No Speedup

train(num_gpus=min(2, jax.local_device_count()), batch_size=256, lr=0.2)

test acc 0.81, 1.00 sec/epoch on 2 GPU(s)

Not a bug — the syllabus. LeNet is too small: halving a small batch underutilizes each GPU. Not a bandwidth problem — the whole gradient set is about half a megabyte. Wrong regime, not wrong technique.

Ring Allreduce

Star: hub moves (k-1)N. Ring (reduce-scatter + all-gather): \frac{2(k-1)}{k}N per device — nearly constant, bounded by 2N for any k. The identity that becomes FSDP.

The Accounting

t_{\text{step}}(k) \approx t_{\text{compute}}(B/k) + 2N/\beta

if jax.local_device_count() >= 2:
    mesh = jax.make_mesh((2,), ('data',))
    P = jax.sharding.PartitionSpec
    N = 64 * 1024 * 1024
    x = jax.device_put(jnp.ones((2, N)),
                       jax.sharding.NamedSharding(mesh, P('data')))
    psum = jax.jit(jax.shard_map(
        lambda a: jax.lax.psum(a, 'data'), mesh=mesh,
        in_specs=P('data'), out_specs=P('data')))
    t = d2l.Benchmark(lambda: psum(x), warmup=2, repeats=5).time
    print(f'psum {2 * N * 4 / t / 1e9:.2f} GB/s effective over {1000*t:.1f} ms')
else:
    print('needs 2 GPUs')

psum 6.53 GB/s effective over 82.3 ms

Raw copies sustain tens of GB/s (PCIe-limited); NCCL’s fallback transport lands lower — one stage of it is the ceiling; §13.6 measures the env-switch workaround, and its limits. So LeNet’s no-speedup isn’t communication — it’s t_{\text{compute}}(B/k) not shrinking when a small batch is halved. Big model + big batch → the second GPU pays (next section).

Lineage

Parameter servers (push/pull): the asynchronous, multi-machine era; alive in recsys embeddings and other sparse, asynchronous state.
Synchronous collectives (ring allreduce): won for dense training; what DDP runs.

Production map → the Tools appendix. Next: let the library run the ring for us.