from d2l import torch as d2l
import torch
from torch import nnThe previous section did data-parallel training the hard way — manual all_reduce, manual replica management. In practice, every framework wraps it in a one-liner:
nn.DataParallel(net) (multi-GPU on one host) or nn.parallel.DistributedDataParallel (multi-host).gluon.Trainer(..., kvstore='device').tf.distribute.MirroredStrategy().Same numerical result; orders of magnitude less boilerplate; NCCL all-reduce under the hood.
We use a small ResNet for these experiments — the speedup from data parallelism only matters once the per-GPU compute is non-trivial:
def resnet18(num_classes, in_channels=1):
"""A slightly modified ResNet-18 model."""
def resnet_block(in_channels, out_channels, num_residuals,
first_block=False):
blk = []
for i in range(num_residuals):
if i == 0 and not first_block:
blk.append(d2l.Residual(out_channels, use_1x1conv=True,
strides=2))
else:
blk.append(d2l.Residual(out_channels))
return nn.Sequential(*blk)
# This model uses a smaller convolution kernel, stride, and padding and
# removes the max-pooling layer
net = nn.Sequential(
nn.Conv2d(in_channels, 64, kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(64),
nn.ReLU())
net.add_module("resnet_block1", resnet_block(64, 64, 2, first_block=True))
net.add_module("resnet_block2", resnet_block(64, 128, 2))
net.add_module("resnet_block3", resnet_block(128, 256, 2))
net.add_module("resnet_block4", resnet_block(256, 512, 2))
net.add_module("global_avg_pool", nn.AdaptiveAvgPool2d((1,1)))
net.add_module("fc", nn.Sequential(nn.Flatten(),
nn.Linear(512, num_classes)))
return netWrap the model in the framework’s data-parallel container. Parameters are replicated to each GPU automatically:
The wrapper also handles inference — splits the input minibatch across replicas, gathers outputs:
The loop looks like ordinary single-GPU training because the wrapper owns the distributed work:
The important lesson is the interface: after wrapping the model, most training code should not need to know how many GPUs are present.
test acc: 0.89, 29.4 sec/epoch on [device(type='cuda', index=0)]
Use this as the throughput baseline before the data-parallel wrapper adds replication and gradient averaging.
test acc: 0.87, 17.1 sec/epoch on [device(type='cuda', index=0), device(type='cuda', index=1)]
The training loop is unchanged; the wrapper splits the minibatch and synchronizes gradients under the hood.
DataParallel, MirroredStrategy) reduce data-parallel SGD to one line of setup.DistributedDataParallel / MultiWorkerMirroredStrategy — same idea, NCCL/Gloo across the network.