The Object Detection Dataset

Detection Datasets

The classic detection benchmarks (PASCAL VOC, COCO) are big — too big for a teaching demo. Instead this section uses the banana detection dataset: 1000 images, one banana per image, fixed size, three random parameters (position, scale, rotation).

The point isn’t to push detection accuracy; it’s to walk through the data plumbing every detector needs:

Read images and per-image lists of (class, x_1, y_1, x_2, y_2).
Pad the per-image label list to a fixed length so it fits in a tensor.
Yield (images, labels) minibatches. Labels have shape (batch, max_objects, 5).

Download

The tiny banana dataset is intentionally simple: one class, one object per image, and normalized box coordinates. That keeps the loader visible before SSD adds anchor matching.

%matplotlib inline
from d2l import jax as d2l
import jax
from jax import numpy as jnp
import optax
import numpy as np
import os
import pandas as pd
from PIL import Image

d2l.DATA_HUB['banana-detection'] = (
    d2l.DATA_URL + 'banana-detection.zip',
    '5de26c8fce5ccdea9f91267273464dc968d20d72')

Reading the dataset

Read all images, parse the CSV-style annotation file, return aligned arrays of images and label tensors:

def read_data_bananas(is_train=True):
    """Read the banana detection dataset images and labels."""
    from PIL import Image
    data_dir = d2l.download_extract('banana-detection')
    csv_fname = os.path.join(data_dir, 'bananas_train' if is_train
                             else 'bananas_val', 'label.csv')
    csv_data = pd.read_csv(csv_fname)
    csv_data = csv_data.set_index('img_name')
    images, targets = [], []
    for img_name, target in csv_data.iterrows():
        img = Image.open(
            os.path.join(data_dir, 'bananas_train' if is_train else
                         'bananas_val', 'images', f'{img_name}'))
        images.append(jnp.array(img).transpose(2, 0, 1))
        # Here `target` contains (class, upper-left x, upper-left y,
        # lower-right x, lower-right y), where all the images have the same
        # banana class (index 0)
        targets.append(list(target))
    return images, jnp.expand_dims(jnp.array(targets), axis=1) / 256

Custom Dataset class

Wrap the loader in the framework’s dataset abstraction so we get a standard batched loader:

class BananasDataset:
    """A customized dataset to load the banana detection dataset."""
    def __init__(self, is_train):
        self.features, self.labels = read_data_bananas(is_train)
        print('read ' + str(len(self.features)) + (f' training examples' if
              is_train else f' validation examples'))

    def __getitem__(self, idx):
        return (self.features[idx].astype(jnp.float32), self.labels[idx])

    def __len__(self):
        return len(self.features)

Train + val loaders

The batch shape check should show why detection labels need an extra object dimension: images have the usual framework layout, while labels are (batch, max_objects, 5).

def load_data_bananas(batch_size):
    """Load the banana detection dataset."""
    train_dataset = BananasDataset(is_train=True)
    val_dataset = BananasDataset(is_train=False)
    train_iter = d2l.ArrayDataLoader(
        jnp.stack(train_dataset.features), train_dataset.labels,
        batch_size=batch_size, shuffle=True)
    val_iter = d2l.ArrayDataLoader(
        jnp.stack(val_dataset.features), val_dataset.labels,
        batch_size=batch_size)
    return train_iter, val_iter

batch_size, edge_size = 32, 256
train_iter, _ = load_data_bananas(batch_size)
batch = next(iter(train_iter))
batch[0].shape, batch[1].shape

read 1000 training examples
read 100 validation examples
((32, 3, 256, 256), (32, 1, 5))

A batch with annotations

Each label row is (class, x1, y1, x2, y2) with normalized corners. In this dataset max_objects = 1, but the same layout supports variable object counts by padding.

imgs = jnp.transpose(batch[0][:10], (0, 2, 3, 1)) / 255
axes = d2l.show_images(imgs, 2, 5, scale=2)
for ax, label in zip(axes, np.array(batch[1][:10])):
    d2l.show_bboxes(ax, [label[0][1:5] * edge_size], colors=['w'])

Recap

Detection minibatch = images + per-image variable-length list of (class, x_1, y_1, x_2, y_2).
Standard fix: pad each list to a fixed max_objects with -1 class for ignore.
Plumbing learned here is reused by SSD; real datasets (COCO, OpenImages) just have more classes and more objects per image.