with d2l.Benchmark('GPU1 & GPU2'): run(x_gpu1) run(x_gpu2) torch.cuda.synchronize()
GPU1 & GPU2: 0.1216 sec
Concurrent run is roughly the time of one GPU — the scheduler used both in parallel.
Computation + communication
Compute on GPU 0 and copy the result to GPU 1 — sequential vs overlapped:
def copy_to_cpu(x, non_blocking=False):return [y.to('cpu', non_blocking=non_blocking) for y in x]with d2l.Benchmark('Run on GPU1'): y = run(x_gpu1) torch.cuda.synchronize()with d2l.Benchmark('Copy to CPU'): y_cpu = copy_to_cpu(y) torch.cuda.synchronize()
Run on GPU1: 0.1232 sec
Copy to CPU: 2.1620 sec
with d2l.Benchmark('Run on GPU1 and copy to CPU'): y = run(x_gpu1) y_cpu = copy_to_cpu(y, True) torch.cuda.synchronize()
Run on GPU1 and copy to CPU: 3.1346 sec
Overlapping shaves real time. Same idea scales to multi-GPU training: fuse all_reduce with the next layer’s forward.
Recap
Async backend + dependency tracker = automatic parallelism across devices.
Independent ops run in parallel; communication overlaps with computation.
No explicit thread management — write straight-line code, the scheduler finds the parallelism.
Frameworks like NCCL, Horovod, DeepSpeed take this further with explicit pipeline / sharded parallelism for very large models.