from d2l import torch as d2l
import numpy, os, subprocess
import torch
from torch import nnGPUs are fast; Python is slow. If every tensor op had to wait for the GPU before Python proceeds to the next line, GPU utilization would be terrible.
The fix: deep-learning frameworks run asynchronously. The Python frontend dispatches an op (returns immediately) and the C++/CUDA backend queues it on a stream. Subsequent ops join the queue. The CPU and GPU work in parallel, and synchronization happens implicitly when you actually need a value (e.g. .numpy(), printing, conversion).
Programming language frontends and DL framework backends.
This deck shows how to measure async behavior, where it backfires, and how to write code that benefits from it.
Time the same computation in pure NumPy vs the framework. NumPy is synchronous; the framework op returns immediately and the GPU runs in the background:
# Warmup for GPU computation
device = d2l.try_gpu()
a = torch.randn(size=(1000, 1000), device=device)
b = torch.mm(a, a)
with d2l.Benchmark('numpy'):
for _ in range(10):
a = numpy.random.normal(size=(1000, 1000))
b = numpy.dot(a, a)
with d2l.Benchmark('torch'):
for _ in range(10):
a = torch.randn(size=(1000, 1000), device=device)
b = torch.mm(a, a)numpy: 0.3747 sec
torch: 0.0007 sec
Done: 0.0011 sec
The backend tracks dependencies between queued ops; ops without dependencies can run in parallel on different streams:
Backend tracks dependencies between graph nodes.
Python frontend hands ops to a C++/CUDA backend; CUDA stream runs them asynchronously.
Anything that needs a value forces a barrier — Python waits until the GPU catches up. Common offenders: printing intermediate values, .item(), .numpy(), control flow based on a tensor value:
Avoid unnecessary barriers. Don’t print(loss) inside the training loop unless you need it. Don’t .cpu().numpy() mid-batch. Save metrics to a list of tensors and reduce later:
.item(), .numpy(), printing, conversion to NumPy.@tf.function to actually be async; PyTorch and MXNet are async by default.