Two GPUs, two independent jobs

Automatic Parallelism

Once the framework runs asynchronously and tracks dependencies, two kinds of parallelism happen for free:

Independent ops on different devices — if op A doesn’t depend on op B, the scheduler can run them in parallel on GPU 0 and GPU 1.
Computation overlapped with communication — while the GPUs reduce gradients across the network, the next layer’s forward pass can start running.

Two-layer MLP scheduled across CPU and 2 GPUs — independent branches run in parallel.

You don’t write any threads. The dependency tracker does it for you. This deck quantifies the speedup.

from d2l import jax as d2l
import jax
from jax import numpy as jnp

Run the same matmul on GPU 0 and GPU 1 separately, then run both at the same time:

devices = jax.devices('gpu')
def run(x):
    return [jnp.dot(x, x) for _ in range(50)]

x_gpu1 = jax.device_put(jax.random.normal(jax.random.PRNGKey(0), (4000, 4000)),
                         devices[0])
x_gpu2 = jax.device_put(jax.random.normal(jax.random.PRNGKey(1), (4000, 4000)),
                         devices[1])

run(x_gpu1)[-1].block_until_ready()  # Warm-up both devices
run(x_gpu2)[-1].block_until_ready()

with d2l.Benchmark('GPU1 time'):
    run(x_gpu1)[-1].block_until_ready()

with d2l.Benchmark('GPU2 time'):
    run(x_gpu2)[-1].block_until_ready()

GPU1 time: 0.0792 sec
GPU2 time: 0.0829 sec

with d2l.Benchmark('GPU1 & GPU2'):
    run(x_gpu1)
    run(x_gpu2)[-1].block_until_ready()

GPU1 & GPU2: 0.1077 sec

Concurrent run is roughly the time of one GPU — the scheduler used both in parallel.

Computation + communication

Compute on GPU 0 and copy the result to GPU 1 — sequential vs overlapped:

def copy_to_cpu(x):
    return [jax.device_put(y, jax.devices('cpu')[0]) for y in x]

with d2l.Benchmark('Run on GPU1'):
    y = run(x_gpu1)
    y[-1].block_until_ready()

with d2l.Benchmark('Copy to CPU'):
    y_cpu = copy_to_cpu(y)
    y_cpu[-1].block_until_ready()

Run on GPU1: 0.0785 sec
Copy to CPU: 0.9835 sec

with d2l.Benchmark('Run on GPU1 and copy to CPU'):
    y = run(x_gpu1)
    y_cpu = copy_to_cpu(y)
    y_cpu[-1].block_until_ready()

Run on GPU1 and copy to CPU: 1.2879 sec

Overlapping shaves real time. Same idea scales to multi-GPU training: fuse all_reduce with the next layer’s forward.

Recap

Async backend + dependency tracker = automatic parallelism across devices.
Independent ops run in parallel; communication overlaps with computation.
No explicit thread management — write straight-line code, the scheduler finds the parallelism.
Frameworks like NCCL, Horovod, DeepSpeed take this further with explicit pipeline / sharded parallelism for very large models.