from d2l import tensorflow as d2l
import tensorflow as tfGPUs are the reason modern deep learning works at scale. A single 4090 does ~80 TFLOPs of FP16 — about a thousand times faster than a CPU on the matmul-heavy ops convolutions and attention need.
The cost: every tensor and every parameter has a device. Mix devices in one operation and you crash.
Add tensors from different devices: implicit copies are forbidden, you must copy explicitly.
RuntimeError: Expected all tensors to be on the same device,
but found at least two devices, cuda:0 and cpu!
(<tensorflow.python.eager.context._EagerDeviceContext at 0x792d00e68c80>,
<tensorflow.python.eager.context._EagerDeviceContext at 0x792d00e0bf40>,
<tensorflow.python.eager.context._EagerDeviceContext at 0x792d00c34180>)
try_gpu(i) returns GPU i if it exists, else CPU. Same code runs on a laptop, a workstation, or a multi-GPU box — the device object swaps but the code stays the same:
(<tensorflow.python.eager.context._EagerDeviceContext at 0x792cfbf35d40>,
<tensorflow.python.eager.context._EagerDeviceContext at 0x792cfbf35b00>,
[<tensorflow.python.eager.context._EagerDeviceContext at 0x792cfbf35ac0>,
<tensorflow.python.eager.context._EagerDeviceContext at 0x792cfbf35a80>,
<tensorflow.python.eager.context._EagerDeviceContext at 0x792cfbf35a40>,
<tensorflow.python.eager.context._EagerDeviceContext at 0x792cfbf35a00>])
Every tensor has a .device attribute:
'/job:localhost/replica:0/task:0/device:GPU:0'
Create directly on a device — avoids an unnecessary CPU → GPU copy:
<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[1., 1., 1.],
[1., 1., 1.]], dtype=float32)>
Tensors on different devices can’t be combined directly. The fix: explicit copy with .cuda(i) or .to(device):
tf.Tensor(
[[1. 1. 1.]
[1. 1. 1.]], shape=(2, 3), dtype=float32)
tf.Tensor(
[[1. 1. 1.]
[1. 1. 1.]], shape=(2, 3), dtype=float32)
<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[1.1338689, 1.1451421, 1.2139267],
[1.2060641, 1.1670239, 1.2071997]], dtype=float32)>
.cuda(i) on a tensor already on GPU i is a no-op — the framework checks first:
True
Why this matters: a .to(device) in your training inner loop adds a cudaMemcpy round trip that can dwarf the actual computation. Copy at the boundary; keep everything inside the loop on one device.
The model is a tree of Parameter tensors. Move them all in one shot with .to(device):
<tf.Tensor: shape=(2, 1), dtype=float32, numpy=
array([[0.07944381],
[0.07944381]], dtype=float32)>
The training-loop sweet spot:
device = try_gpu(0)
model = MyModel().to(device) # once, before training
opt = SGD(model.parameters(), …) # picks up device
for batch in loader:
X, y = batch[0].to(device), batch[1].to(device)
# ... forward, loss, backward, step ...
The Trainer baseline does exactly this — patch prepare_batch to call .to(device) and prepare_model to move parameters once:
@d2l.add_to_class(d2l.Trainer)
def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0):
self.save_hyperparameters()
self.gpus = [d2l.gpu(i) for i in range(min(num_gpus, d2l.num_gpus()))]
@d2l.add_to_class(d2l.Trainer)
def prepare_batch(self, batch):
if self.gpus:
# tf.data.Dataset emits batches on CPU. Re-wrap them inside the
# GPU device context so subsequent ops keep their inputs on-device
# rather than incurring an implicit copy each step.
with self.gpus[0]:
batch = [tf.identity(a) for a in batch]
return batchregister_buffer so they move with .to(device).torch.zeros((10,)) mid-forward defaults to CPU. Use torch.zeros((10,), device=x.device) to follow the input..to(device) — otherwise its state lives on the wrong side..numpy() mid-loop forces a sync to CPU. The asynchronous CUDA stream stalls. Defer all conversions to the end of the epoch.try_gpu(i) keeps code portable across hardware.register_buffer so non-trainable state moves alongside parameters.