The Object Detection Dataset

Detection Datasets

The classic detection benchmarks (PASCAL VOC, COCO) are big — too big for a teaching demo. Instead this section uses the banana detection dataset: 1000 images, one banana per image, fixed size, three random parameters (position, scale, rotation).

The point isn’t to push detection accuracy; it’s to walk through the data plumbing every detector needs:

Read images and per-image lists of (class, x_1, y_1, x_2, y_2).
Pad the per-image label list to a fixed length so it fits in a tensor.
Yield (images, labels) minibatches. Labels have shape (batch, max_objects, 5).

Download

The tiny banana dataset is intentionally simple: one class, one object per image, and normalized box coordinates. That keeps the loader visible before SSD adds anchor matching.

%matplotlib inline
from d2l import torch as d2l
import torch
import torchvision
import os
import pandas as pd

d2l.DATA_HUB['banana-detection'] = (
    d2l.DATA_URL + 'banana-detection.zip',
    '5de26c8fce5ccdea9f91267273464dc968d20d72')

Reading the dataset

Read all images, parse the CSV-style annotation file, return aligned arrays of images and label tensors:

def read_data_bananas(is_train=True):
    """Read the banana detection dataset images and labels."""
    data_dir = d2l.download_extract('banana-detection')
    csv_fname = os.path.join(data_dir, 'bananas_train' if is_train
                             else 'bananas_val', 'label.csv')
    csv_data = pd.read_csv(csv_fname)
    csv_data = csv_data.set_index('img_name')
    images, targets = [], []
    for img_name, target in csv_data.iterrows():
        images.append(torchvision.io.read_image(
            os.path.join(data_dir, 'bananas_train' if is_train else
                         'bananas_val', 'images', f'{img_name}')))
        # Here `target` contains (class, upper-left x, upper-left y,
        # lower-right x, lower-right y), where all the images have the same
        # banana class (index 0)
        targets.append(list(target))
    return images, torch.tensor(targets).unsqueeze(1) / 256

Custom Dataset class

Wrap the loader in the framework’s dataset abstraction so we get a standard batched loader:

class BananasDataset(torch.utils.data.Dataset):
    """A customized dataset to load the banana detection dataset."""
    def __init__(self, is_train):
        self.features, self.labels = read_data_bananas(is_train)
        print('read ' + str(len(self.features)) + (f' training examples' if
              is_train else f' validation examples'))

    def __getitem__(self, idx):
        return (self.features[idx].float(), self.labels[idx])

    def __len__(self):
        return len(self.features)

Train + val loaders

The batch shape check should show why detection labels need an extra object dimension: images have the usual framework layout, while labels are (batch, max_objects, 5).

def load_data_bananas(batch_size):
    """Load the banana detection dataset."""
    train_iter = torch.utils.data.DataLoader(BananasDataset(is_train=True),
                                             batch_size, shuffle=True)
    val_iter = torch.utils.data.DataLoader(BananasDataset(is_train=False),
                                           batch_size)
    return train_iter, val_iter

batch_size, edge_size = 32, 256
train_iter, _ = load_data_bananas(batch_size)
batch = next(iter(train_iter))
batch[0].shape, batch[1].shape

read 1000 training examples
read 100 validation examples
(torch.Size([32, 3, 256, 256]), torch.Size([32, 1, 5]))

A batch with annotations

Each label row is (class, x1, y1, x2, y2) with normalized corners. In this dataset max_objects = 1, but the same layout supports variable object counts by padding.

imgs = (batch[0][:10].permute(0, 2, 3, 1)) / 255
axes = d2l.show_images(imgs, 2, 5, scale=2)
for ax, label in zip(axes, batch[1][:10]):
    d2l.show_bboxes(ax, [label[0][1:5] * edge_size], colors=['w'])

Recap

Detection minibatch = images + per-image variable-length list of (class, x_1, y_1, x_2, y_2).
Standard fix: pad each list to a fixed max_objects with -1 class for ignore.
Plumbing learned here is reused by SSD; real datasets (COCO, OpenImages) just have more classes and more objects per image.