# Warm-up: trigger tf.function tracing and CUDA kernel loading on both GPUsrun(x_gpu1)[-1].numpy()run(x_gpu2)[-1].numpy()with d2l.Benchmark('GPU1 time'): run(x_gpu1)[-1].numpy()with d2l.Benchmark('GPU2 time'): run(x_gpu2)[-1].numpy()
GPU1 time: 0.0629 sec
GPU2 time: 0.0668 sec
with d2l.Benchmark('GPU1 & GPU2'): run(x_gpu1) run(x_gpu2)[-1].numpy()
GPU1 & GPU2: 0.0869 sec
Concurrent run is roughly the time of one GPU — the scheduler used both in parallel.
Computation + communication
Compute on GPU 0 and copy the result to GPU 1 — sequential vs overlapped:
def copy_to_cpu(x):with tf.device('/CPU:0'):return [tf.identity(t) for t in x]with d2l.Benchmark('Run on GPU1'): y = run(x_gpu1) y[-1].numpy() # Sync GPUwith d2l.Benchmark('Copy to CPU'): y_cpu = copy_to_cpu(y) y_cpu[-1].numpy() # Sync transfer
Run on GPU1: 0.0557 sec
Copy to CPU: 2.1474 sec
with d2l.Benchmark('Run on GPU1 and copy to CPU'): y = run(x_gpu1) y_cpu = copy_to_cpu(y) y_cpu[-1].numpy()
Run on GPU1 and copy to CPU: 2.5692 sec
Overlapping shaves real time. Same idea scales to multi-GPU training: fuse all_reduce with the next layer’s forward.
Recap
Async backend + dependency tracker = automatic parallelism across devices.
Independent ops run in parallel; communication overlaps with computation.
No explicit thread management — write straight-line code, the scheduler finds the parallelism.
Frameworks like NCCL, Horovod, DeepSpeed take this further with explicit pipeline / sharded parallelism for very large models.