GPUs

Working on GPUs

GPUs are the reason modern deep learning works at scale. A single 4090 does ~80 TFLOPs of FP16 — about a thousand times faster than a CPU on the matmul-heavy ops convolutions and attention need.

The cost: every tensor and every parameter has a device. Mix devices in one operation and you crash.

Add tensors from different devices: implicit copies are forbidden, you must copy explicitly.

The two-and-a-half rules

RuntimeError: Expected all tensors to be on the same device,
but found at least two devices, cuda:0 and cpu!

Tensors live on a device. Cross-device operations require an explicit copy.
Model parameters live on a device. Move the model to the GPU before training; the optimizer’s state follows.
Cross-device copies are slow. Avoid them in the inner loop — copy at the boundary, keep the loop on one device.

What hardware do we have?

from d2l import tensorflow as d2l
import tensorflow as tf

def cpu():
    """Get the CPU device."""
    return tf.device('/CPU:0')
def gpu(i=0):
    """Get a GPU device."""
    return tf.device(f'/GPU:{i}')
cpu(), gpu(), gpu(1)

(<tensorflow.python.eager.context._EagerDeviceContext at 0x792d00e68c80>,
 <tensorflow.python.eager.context._EagerDeviceContext at 0x792d00e0bf40>,
 <tensorflow.python.eager.context._EagerDeviceContext at 0x792d00c34180>)

def num_gpus():
    """Get the number of available GPUs."""
    return len(tf.config.experimental.list_physical_devices('GPU'))
num_gpus()

Portable device handle

try_gpu(i) returns GPU i if it exists, else CPU. Same code runs on a laptop, a workstation, or a multi-GPU box — the device object swaps but the code stays the same:

def try_gpu(i=0):
    """Return gpu(i) if exists, otherwise return cpu()."""
    if num_gpus() >= i + 1:
        return gpu(i)
    return cpu()

def try_all_gpus():
    """Return all available GPUs, or [cpu(),] if no GPU exists."""
    return [gpu(i) for i in range(num_gpus())]

try_gpu(), try_gpu(10), try_all_gpus()

(<tensorflow.python.eager.context._EagerDeviceContext at 0x792cfbf35d40>,
 <tensorflow.python.eager.context._EagerDeviceContext at 0x792cfbf35b00>,
 [<tensorflow.python.eager.context._EagerDeviceContext at 0x792cfbf35ac0>,
  <tensorflow.python.eager.context._EagerDeviceContext at 0x792cfbf35a80>,
  <tensorflow.python.eager.context._EagerDeviceContext at 0x792cfbf35a40>,
  <tensorflow.python.eager.context._EagerDeviceContext at 0x792cfbf35a00>])

Tensors carry a device

Every tensor has a .device attribute:

x = tf.constant([1, 2, 3])
x.device

'/job:localhost/replica:0/task:0/device:GPU:0'

Create directly on a device — avoids an unnecessary CPU → GPU copy:

with try_gpu():
    X = tf.ones((2, 3))
X

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[1., 1., 1.],
       [1., 1., 1.]], dtype=float32)>

with try_gpu(1):
    Y = tf.random.uniform((2, 3))
Y

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[0.13386893, 0.14514208, 0.21392667],
       [0.2060641 , 0.1670239 , 0.2071997 ]], dtype=float32)>

Cross-device math: copy, then operate

Tensors on different devices can’t be combined directly. The fix: explicit copy with .cuda(i) or .to(device):

with try_gpu(1):
    Z = X
print(X)
print(Z)

tf.Tensor(
[[1. 1. 1.]
 [1. 1. 1.]], shape=(2, 3), dtype=float32)
tf.Tensor(
[[1. 1. 1.]
 [1. 1. 1.]], shape=(2, 3), dtype=float32)

Y + Z

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[1.1338689, 1.1451421, 1.2139267],
       [1.2060641, 1.1670239, 1.2071997]], dtype=float32)>

.cuda(i) on a tensor already on GPU i is a no-op — the framework checks first:

with try_gpu(1):
    Z2 = Z
Z2 is Z

True

Why this matters: a .to(device) in your training inner loop adds a cudaMemcpy round trip that can dwarf the actual computation. Copy at the boundary; keep everything inside the loop on one device.

Models on the GPU

The model is a tree of Parameter tensors. Move them all in one shot with .to(device):

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    net = tf.keras.models.Sequential([
        tf.keras.layers.Dense(1)])

net(X)

<tf.Tensor: shape=(2, 1), dtype=float32, numpy=
array([[0.07944381],
       [0.07944381]], dtype=float32)>

After this, every input batch must also be on device before the forward pass:

tf.identity(net.layers[0].weights[0]).device, tf.identity(net.layers[0].weights[1]).device

('/job:localhost/replica:0/task:0/device:GPU:0',
 '/job:localhost/replica:0/task:0/device:GPU:0')

Where to put the device move

The training-loop sweet spot:

device = try_gpu(0)
model = MyModel().to(device)         # once, before training
opt = SGD(model.parameters(), …)     # picks up device

for batch in loader:
    X, y = batch[0].to(device), batch[1].to(device)
    # ... forward, loss, backward, step ...

The Trainer baseline does exactly this — patch prepare_batch to call .to(device) and prepare_model to move parameters once:

@d2l.add_to_class(d2l.Trainer)
def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0):
    self.save_hyperparameters()
    self.gpus = [d2l.gpu(i) for i in range(min(num_gpus, d2l.num_gpus()))]

@d2l.add_to_class(d2l.Trainer)
def prepare_batch(self, batch):
    if self.gpus:
        # tf.data.Dataset emits batches on CPU. Re-wrap them inside the
        # GPU device context so subsequent ops keep their inputs on-device
        # rather than incurring an implicit copy each step.
        with self.gpus[0]:
            batch = [tf.identity(a) for a in batch]
    return batch

Common mistakes

Forgetting one tensor. Every tensor in the forward pass has to be on the same device. Custom buffers are the usual culprit — use register_buffer so they move with .to(device).
Creating tensors with torch.zeros((10,)) mid-forward defaults to CPU. Use torch.zeros((10,), device=x.device) to follow the input.
Optimizer set up before move. Construct the optimizer after .to(device) — otherwise its state lives on the wrong side.
.numpy() mid-loop forces a sync to CPU. The asynchronous CUDA stream stalls. Defer all conversions to the end of the epoch.

Recap

Tensors and parameters carry a device; cross-device operations require an explicit copy.
Move the model to the GPU once, before training; the optimizer follows its parameters.
try_gpu(i) keeps code portable across hardware.
Cross-device copies are expensive — keep the inner loop device-clean.
Use register_buffer so non-trainable state moves alongside parameters.