A bigger model

Concise Implementation for Multiple GPUs

Concise multi-GPU training

The previous section did data-parallel training the hard way — manual all_reduce, manual replica management. In practice, every framework wraps it in a one-liner:

  • PyTorch: nn.DataParallel(net) (multi-GPU on one host) or nn.parallel.DistributedDataParallel (multi-host).
  • MXNet: gluon.Trainer(..., kvstore='device').
  • TensorFlow: tf.distribute.MirroredStrategy().

Same numerical result; orders of magnitude less boilerplate; NCCL all-reduce under the hood.

from d2l import mxnet as d2l
from mxnet import autograd, gluon, init, np, npx
from mxnet.gluon import nn
npx.set_np()

We use a small ResNet for these experiments — the speedup from data parallelism only matters once the per-GPU compute is non-trivial:

def resnet18(num_classes):
    """A slightly modified ResNet-18 model."""
    def resnet_block(num_channels, num_residuals, first_block=False):
        blk = nn.Sequential()
        for i in range(num_residuals):
            if i == 0 and not first_block:
                blk.add(d2l.Residual(
                    num_channels, use_1x1conv=True, strides=2))
            else:
                blk.add(d2l.Residual(num_channels))
        return blk

    net = nn.Sequential()
    # This model uses a smaller convolution kernel, stride, and padding and
    # removes the max-pooling layer
    net.add(nn.Conv2D(64, kernel_size=3, strides=1, padding=1),
            nn.BatchNorm(), nn.Activation('relu'))
    net.add(resnet_block(64, 2, first_block=True),
            resnet_block(128, 2),
            resnet_block(256, 2),
            resnet_block(512, 2))
    net.add(nn.GlobalAvgPool2D(), nn.Dense(num_classes))
    return net

Multi-GPU initialization

Wrap the model in the framework’s data-parallel container. Parameters are replicated to each GPU automatically:

net = resnet18(10)
# Get a list of GPUs
devices = d2l.try_all_gpus()
# Initialize all the parameters of the network
net.initialize(init=init.Normal(sigma=0.01), ctx=devices)
x = np.random.uniform(size=(4, 1, 28, 28))
x_shards = gluon.utils.split_and_load(x, devices)
net(x_shards[0]), net(x_shards[1])
weight = net[0].params.get('weight')

try:
    weight.data()
except RuntimeError:
    print('not initialized on cpu')
weight.data(devices[0])[0], weight.data(devices[1])[0]

Parallel evaluation

The wrapper also handles inference — splits the input minibatch across replicas, gathers outputs:

def evaluate_accuracy_gpus(net, data_iter, split_f=d2l.split_batch):
    """Compute the accuracy for a model on a dataset using multiple GPUs."""
    # Query the list of devices
    devices = list(net.collect_params().values())[0].list_ctx()
    # No. of correct predictions, no. of predictions
    metric = d2l.Accumulator(2)
    for features, labels in data_iter:
        X_shards, y_shards = split_f(features, labels, devices)
        # Run in parallel
        pred_shards = [net(X_shard) for X_shard in X_shards]
        metric.add(sum(float(d2l.accuracy(pred_shard, y_shard)) for
                       pred_shard, y_shard in zip(
                           pred_shards, y_shards)), labels.size)
    return metric[0] / metric[1]

Training loop

The loop looks like ordinary single-GPU training because the wrapper owns the distributed work:

  • scatter each minibatch across devices;
  • run the same model replica on each shard;
  • average gradients across replicas;
  • step one synchronized set of parameters.

The important lesson is the interface: after wrapping the model, most training code should not need to know how many GPUs are present.

Single-GPU baseline

train(num_gpus=1, batch_size=256, lr=0.1)

Use this as the throughput baseline before the data-parallel wrapper adds replication and gradient averaging.

Two GPUs

train(num_gpus=2, batch_size=512, lr=0.2)

The training loop is unchanged; the wrapper splits the minibatch and synchronizes gradients under the hood.

Recap

  • Framework wrappers (DataParallel, MirroredStrategy) reduce data-parallel SGD to one line of setup.
  • Same numerical recipe as the from-scratch version: replicate, split, all-reduce, identical step.
  • For multi-host distributed training, use DistributedDataParallel / MultiWorkerMirroredStrategy — same idea, NCCL/Gloo across the network.