Two GPUs, two independent jobs

Automatic Parallelism

Once the framework runs asynchronously and tracks dependencies, two kinds of parallelism happen for free:

Independent ops on different devices — if op A doesn’t depend on op B, the scheduler can run them in parallel on GPU 0 and GPU 1.
Computation overlapped with communication — while the GPUs reduce gradients across the network, the next layer’s forward pass can start running.

Two-layer MLP scheduled across CPU and 2 GPUs — independent branches run in parallel.

You don’t write any threads. The dependency tracker does it for you. This deck quantifies the speedup.

from d2l import mxnet as d2l
from mxnet import np, npx
npx.set_np()

Run the same matmul on GPU 0 and GPU 1 separately, then run both at the same time:

devices = d2l.try_all_gpus()
def run(x):
    return [x.dot(x) for _ in range(50)]

x_gpu1 = np.random.uniform(size=(4000, 4000), ctx=devices[0])
x_gpu2 = np.random.uniform(size=(4000, 4000), ctx=devices[1])

run(x_gpu1)  # Warm-up both devices
run(x_gpu2)
npx.waitall()

with d2l.Benchmark('GPU1 time'):
    run(x_gpu1)
    npx.waitall()

with d2l.Benchmark('GPU2 time'):
    run(x_gpu2)
    npx.waitall()

with d2l.Benchmark('GPU1 & GPU2'):
    run(x_gpu1)
    run(x_gpu2)
    npx.waitall()

Concurrent run is roughly the time of one GPU — the scheduler used both in parallel.

Computation + communication

Compute on GPU 0 and copy the result to GPU 1 — sequential vs overlapped:

def copy_to_cpu(x):
    return [y.copyto(npx.cpu()) for y in x]

with d2l.Benchmark('Run on GPU1'):
    y = run(x_gpu1)
    npx.waitall()

with d2l.Benchmark('Copy to CPU'):
    y_cpu = copy_to_cpu(y)
    npx.waitall()

with d2l.Benchmark('Run on GPU1 and copy to CPU'):
    y = run(x_gpu1)
    y_cpu = copy_to_cpu(y)
    npx.waitall()

Overlapping shaves real time. Same idea scales to multi-GPU training: fuse all_reduce with the next layer’s forward.

Recap

Async backend + dependency tracker = automatic parallelism across devices.
Independent ops run in parallel; communication overlaps with computation.
No explicit thread management — write straight-line code, the scheduler finds the parallelism.
Frameworks like NCCL, Horovod, DeepSpeed take this further with explicit pipeline / sharded parallelism for very large models.