Two GPUs, two independent jobs

Automatic Parallelism

Once the framework runs asynchronously and tracks dependencies, two kinds of parallelism happen for free:

Independent ops on different devices — if op A doesn’t depend on op B, the scheduler can run them in parallel on GPU 0 and GPU 1.
Computation overlapped with communication — while the GPUs reduce gradients across the network, the next layer’s forward pass can start running.

Two-layer MLP scheduled across CPU and 2 GPUs — independent branches run in parallel.

You don’t write any threads. The dependency tracker does it for you. This deck quantifies the speedup.

from d2l import torch as d2l
import torch

Run the same matmul on GPU 0 and GPU 1 separately, then run both at the same time:

devices = d2l.try_all_gpus()
def run(x):
    return [x.mm(x) for _ in range(50)]

x_gpu1 = torch.rand(size=(4000, 4000), device=devices[0])
x_gpu2 = torch.rand(size=(4000, 4000), device=devices[1])

run(x_gpu1)
run(x_gpu2)  # Warm-up all devices
torch.cuda.synchronize(devices[0])
torch.cuda.synchronize(devices[1])

with d2l.Benchmark('GPU1 time'):
    run(x_gpu1)
    torch.cuda.synchronize(devices[0])

with d2l.Benchmark('GPU2 time'):
    run(x_gpu2)
    torch.cuda.synchronize(devices[1])

GPU1 time: 0.1195 sec
GPU2 time: 0.1189 sec

with d2l.Benchmark('GPU1 & GPU2'):
    run(x_gpu1)
    run(x_gpu2)
    torch.cuda.synchronize()

GPU1 & GPU2: 0.1216 sec

Concurrent run is roughly the time of one GPU — the scheduler used both in parallel.

Computation + communication

Compute on GPU 0 and copy the result to GPU 1 — sequential vs overlapped:

def copy_to_cpu(x, non_blocking=False):
    return [y.to('cpu', non_blocking=non_blocking) for y in x]

with d2l.Benchmark('Run on GPU1'):
    y = run(x_gpu1)
    torch.cuda.synchronize()

with d2l.Benchmark('Copy to CPU'):
    y_cpu = copy_to_cpu(y)
    torch.cuda.synchronize()

Run on GPU1: 0.1232 sec
Copy to CPU: 2.1620 sec

with d2l.Benchmark('Run on GPU1 and copy to CPU'):
    y = run(x_gpu1)
    y_cpu = copy_to_cpu(y, True)
    torch.cuda.synchronize()

Run on GPU1 and copy to CPU: 3.1346 sec

Overlapping shaves real time. Same idea scales to multi-GPU training: fuse all_reduce with the next layer’s forward.

Recap

Async backend + dependency tracker = automatic parallelism across devices.
Independent ops run in parallel; communication overlaps with computation.
No explicit thread management — write straight-line code, the scheduler finds the parallelism.
Frameworks like NCCL, Horovod, DeepSpeed take this further with explicit pipeline / sharded parallelism for very large models.