from d2l import tensorflow as d2l
import numpy
import tensorflow as tfGPUs are fast; Python is slow. If every tensor op had to wait for the GPU before Python proceeds to the next line, GPU utilization would be terrible.
The fix: deep-learning frameworks run asynchronously. The Python frontend dispatches an op (returns immediately) and the C++/CUDA backend queues it on a stream. Subsequent ops join the queue. The CPU and GPU work in parallel, and synchronization happens implicitly when you actually need a value (e.g. .numpy(), printing, conversion).
Programming language frontends and DL framework backends.
This deck shows how to measure async behavior, where it backfires, and how to write code that benefits from it.
Time the same computation in pure NumPy vs the framework. NumPy is synchronous; the framework op returns immediately and the GPU runs in the background:
# Warmup for GPU computation
with tf.device('/GPU:0'):
a = tf.random.normal(shape=(1000, 1000))
b = tf.linalg.matmul(a, a)
_ = b.numpy() # Force synchronization for warmup
with d2l.Benchmark('numpy'):
for _ in range(10):
a = numpy.random.normal(size=(1000, 1000))
b = numpy.dot(a, a)
with d2l.Benchmark('tensorflow'):
for _ in range(10):
with tf.device('/GPU:0'):
a = tf.random.normal(shape=(1000, 1000))
b = tf.linalg.matmul(a, a)numpy: 0.3730 sec
tensorflow: 0.0061 sec
Done: 0.0132 sec
The backend tracks dependencies between queued ops; ops without dependencies can run in parallel on different streams:
Backend tracks dependencies between graph nodes.
Python frontend hands ops to a C++/CUDA backend; CUDA stream runs them asynchronously.
Anything that needs a value forces a barrier — Python waits until the GPU catches up. Common offenders: printing intermediate values, .item(), .numpy(), control flow based on a tensor value:
numpy conversion: 0.0065 sec
scalar conversion: 0.0116 sec
Avoid unnecessary barriers. Don’t print(loss) inside the training loop unless you need it. Don’t .cpu().numpy() mid-batch. Save metrics to a list of tensors and reduce later:
x = tf.ones((1, 2))
with d2l.Benchmark('synchronous (eager + .numpy() barrier)'):
for _ in range(10000):
y = x + 1
_ = y.numpy() # Forces synchronization after every op
with d2l.Benchmark('asynchronous (eager, single sync at end)'):
for _ in range(10000):
y = x + 1
_ = y.numpy() # Single sync at the end
@tf.function
def add_loop(x, n):
for _ in tf.range(n):
x = x + 1
return x
# Warm up the tf.function trace
_ = add_loop(x, tf.constant(1))
with d2l.Benchmark('tf.function (compiled graph)'):
y = add_loop(x, tf.constant(10000))
_ = y.numpy()synchronous (eager + .numpy() barrier): 1.6383 sec
asynchronous (eager, single sync at end): 0.6561 sec
tf.function (compiled graph): 0.9407 sec
.item(), .numpy(), printing, conversion to NumPy.@tf.function to actually be async; PyTorch and MXNet are async by default.