with d2l.Benchmark('GPU1 & GPU2'): run(x_gpu1) run(x_gpu2) npx.waitall()
Concurrent run is roughly the time of one GPU — the scheduler used both in parallel.
Computation + communication
Compute on GPU 0 and copy the result to GPU 1 — sequential vs overlapped:
def copy_to_cpu(x):return [y.copyto(npx.cpu()) for y in x]with d2l.Benchmark('Run on GPU1'): y = run(x_gpu1) npx.waitall()with d2l.Benchmark('Copy to CPU'): y_cpu = copy_to_cpu(y) npx.waitall()
with d2l.Benchmark('Run on GPU1 and copy to CPU'): y = run(x_gpu1) y_cpu = copy_to_cpu(y) npx.waitall()
Overlapping shaves real time. Same idea scales to multi-GPU training: fuse all_reduce with the next layer’s forward.
Recap
Async backend + dependency tracker = automatic parallelism across devices.
Independent ops run in parallel; communication overlaps with computation.
No explicit thread management — write straight-line code, the scheduler finds the parallelism.
Frameworks like NCCL, Horovod, DeepSpeed take this further with explicit pipeline / sharded parallelism for very large models.