Two GPUs, two independent jobs

Automatic Parallelism

Once the framework runs asynchronously and tracks dependencies, two kinds of parallelism happen for free:

Independent ops on different devices — if op A doesn’t depend on op B, the scheduler can run them in parallel on GPU 0 and GPU 1.
Computation overlapped with communication — while the GPUs reduce gradients across the network, the next layer’s forward pass can start running.

Two-layer MLP scheduled across CPU and 2 GPUs — independent branches run in parallel.

You don’t write any threads. The dependency tracker does it for you. This deck quantifies the speedup.

from d2l import tensorflow as d2l
import tensorflow as tf

Run the same matmul on GPU 0 and GPU 1 separately, then run both at the same time:

devices = tf.config.list_logical_devices('GPU')

@tf.function
def run(x):
    return [tf.linalg.matmul(x, x) for _ in range(50)]

with tf.device(devices[0].name):
    x_gpu1 = tf.random.normal(shape=(4000, 4000))
with tf.device(devices[1].name):
    x_gpu2 = tf.random.normal(shape=(4000, 4000))

# Warm-up: trigger tf.function tracing and CUDA kernel loading on both GPUs
run(x_gpu1)[-1].numpy()
run(x_gpu2)[-1].numpy()

with d2l.Benchmark('GPU1 time'):
    run(x_gpu1)[-1].numpy()

with d2l.Benchmark('GPU2 time'):
    run(x_gpu2)[-1].numpy()

GPU1 time: 0.0629 sec
GPU2 time: 0.0668 sec

with d2l.Benchmark('GPU1 & GPU2'):
    run(x_gpu1)
    run(x_gpu2)[-1].numpy()

GPU1 & GPU2: 0.0869 sec

Concurrent run is roughly the time of one GPU — the scheduler used both in parallel.

Computation + communication

Compute on GPU 0 and copy the result to GPU 1 — sequential vs overlapped:

def copy_to_cpu(x):
    with tf.device('/CPU:0'):
        return [tf.identity(t) for t in x]

with d2l.Benchmark('Run on GPU1'):
    y = run(x_gpu1)
    y[-1].numpy()  # Sync GPU

with d2l.Benchmark('Copy to CPU'):
    y_cpu = copy_to_cpu(y)
    y_cpu[-1].numpy()  # Sync transfer

Run on GPU1: 0.0557 sec
Copy to CPU: 2.1474 sec

with d2l.Benchmark('Run on GPU1 and copy to CPU'):
    y = run(x_gpu1)
    y_cpu = copy_to_cpu(y)
    y_cpu[-1].numpy()

Run on GPU1 and copy to CPU: 2.5692 sec

Overlapping shaves real time. Same idea scales to multi-GPU training: fuse all_reduce with the next layer’s forward.

Recap

Async backend + dependency tracker = automatic parallelism across devices.
Independent ops run in parallel; communication overlaps with computation.
No explicit thread management — write straight-line code, the scheduler finds the parallelism.
Frameworks like NCCL, Horovod, DeepSpeed take this further with explicit pipeline / sharded parallelism for very large models.