from d2l import mxnet as d2l
import numpy, os, subprocess
from mxnet import autograd, gluon, np, npx
from mxnet.gluon import nn
npx.set_np()GPUs are fast; Python is slow. If every tensor op had to wait for the GPU before Python proceeds to the next line, GPU utilization would be terrible.
The fix: deep-learning frameworks run asynchronously. The Python frontend dispatches an op (returns immediately) and the C++/CUDA backend queues it on a stream. Subsequent ops join the queue. The CPU and GPU work in parallel, and synchronization happens implicitly when you actually need a value (e.g. .numpy(), printing, conversion).
Programming language frontends and DL framework backends.
This deck shows how to measure async behavior, where it backfires, and how to write code that benefits from it.
Time the same computation in pure NumPy vs the framework. NumPy is synchronous; the framework op returns immediately and the GPU runs in the background:
The backend tracks dependencies between queued ops; ops without dependencies can run in parallel on different streams:
Backend tracks dependencies between graph nodes.
Python frontend hands ops to a C++/CUDA backend; CUDA stream runs them asynchronously.
Anything that needs a value forces a barrier — Python waits until the GPU catches up. Common offenders: printing intermediate values, .item(), .numpy(), control flow based on a tensor value:
Avoid unnecessary barriers. Don’t print(loss) inside the training loop unless you need it. Don’t .cpu().numpy() mid-batch. Save metrics to a list of tensors and reduce later:
.item(), .numpy(), printing, conversion to NumPy.@tf.function to actually be async; PyTorch and MXNet are async by default.