The Image Classification Dataset

Dive into Deep Learning · §3.2

The Image Classification Dataset
Fashion-MNIST, the dataset we will classify for the rest of this chapter.

MNIST is solved; Fashion-MNIST is not

Motivation

On MNIST, even simple models exceed 95% and a linear one tops 90%: models are hard to tell apart.
Fashion-MNIST: a drop-in replacement, same shape and API, but harder clothing classes (28\times28 grayscale, 10 classes, 60 k / 10 k).

Here a linear model caps out near 82% (the softmax-from-scratch section): headroom the deeper models of later chapters will spend.

Loading the Data

a reusable DataModule per framework

Wrap it once, reuse everywhere

A DataModule owns this framework’s download, transform, and train/val splits, so every model we build later just asks for batches:

class FashionMNIST(d2l.DataModule):
    """The Fashion-MNIST dataset."""
    def __init__(self, batch_size=64, resize=(28, 28)):
        super().__init__()
        self.save_hyperparameters()
        trans = transforms.Compose([transforms.Resize(resize),
                                    transforms.ToTensor()])
        self.train = torchvision.datasets.FashionMNIST(
            root=self.root, train=True, transform=trans, download=True)
        self.val = torchvision.datasets.FashionMNIST(
            root=self.root, train=False, transform=trans, download=True)

60 000 train, 10 000 test

Instantiate it, resizing to 32\times32 to match the ConvNet inputs in later chapters:

data = FashionMNIST(resize=(32, 32))
len(data.train), len(data.val)

(60000, 10000)

Ten classes \times 6 000 train images each = 60\,000; 1 000 each in test.

One image: channel-first

Loading · layout

PyTorch and MXNet store images channel-first, c \times h \times w, with the color axis before height and width:

X, y = next(iter(data.train_dataloader()))
X[0].shape  # channel-first: (channels, height, width)

Shape is (1, 32, 32): one grayscale channel, then 32\times32 pixels.

Labels as words, not integers

The dataset stores labels as integers 0–9. A tiny helper maps them to names so our spot-checks are readable:

@d2l.add_to_class(FashionMNIST)
def text_labels(self, indices):
    """Return text labels."""
    labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',
              'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
    return [labels[int(i)] for i in indices]

Reading Minibatches

the iterator that feeds training

The data iterator

Minibatches

get_dataloader shuffles the training split and serves a batch_size-sized minibatch each step:

@d2l.add_to_class(FashionMNIST)
def get_dataloader(self, train):
    data = self.train if train else self.val
    return torch.utils.data.DataLoader(data, self.batch_size, shuffle=train,
                                       num_workers=self.num_workers)

What one minibatch looks like

Minibatches

Pull one batch and read its shapes off directly:

X, y = next(iter(data.train_dataloader()))
print(X.shape, X.dtype, y.shape, y.dtype)

torch.Size([64, 1, 32, 32]) torch.float32 torch.Size([64]) torch.int64

64 images, one grayscale channel, 32\times32 pixels, plus 64 integer labels arriving as a matching vector.

Loading is not the bottleneck: measure it

Minibatches · timing

Time one full pass over all 60,000 training images:

tic = time.time()
for X, y in data.train_dataloader():
    continue
f'{time.time() - tic:.2f} sec'

'2.27 sec'

Seconds, not minutes. For the ConvNets of later chapters, one forward + backward pass costs 10–100× the corresponding I/O, so a well-built loader keeps data off the critical path. If it ever were the bottleneck: prefetch and raise num_workers.

Looking at the Data

always eyeball what you train on

See the data before you model it

Visualization

A visualize method tiles one validation batch, each image captioned with its class name. Eyeballing data is a cheap, powerful sanity check:

@d2l.add_to_class(FashionMNIST)
def visualize(self, batch, nrows=1, ncols=8, labels=None):
    X, y = batch
    if not labels:
        labels = self.text_labels(y)
    d2l.show_images(X.squeeze(1), nrows, ncols, titles=labels)
batch = next(iter(data.val_dataloader()))
data.visualize(batch)

Recap

Wrap-up

Fashion-MNIST: 10 clothing classes, 28\times28 grayscale, harder than MNIST but the same size and API.
A DataModule owns each framework’s download, transforms, and train/val loaders.

Channel axis differs: PyTorch/MXNet c\times h\times w, TensorFlow/JAX h\times w\times c (the loader hides it).
Always look at your data; a full loading pass costs seconds, so training speed is set by the model, not I/O.
Next: a linear classifier on this data, and its ~82% ceiling (the softmax-from-scratch section).