%matplotlib inline
from d2l import torch as d2l
import torch
import torchvision
import os
import pandas as pdThe classic detection benchmarks (PASCAL VOC, COCO) are big — too big for a teaching demo. Instead this section uses the banana detection dataset: 1000 images, one banana per image, fixed size, three random parameters (position, scale, rotation).
The point isn’t to push detection accuracy; it’s to walk through the data plumbing every detector needs:
(images, labels) minibatches. Labels have shape (batch, max_objects, 5).The tiny banana dataset is intentionally simple: one class, one object per image, and normalized box coordinates. That keeps the loader visible before SSD adds anchor matching.
Read all images, parse the CSV-style annotation file, return aligned arrays of images and label tensors:
def read_data_bananas(is_train=True):
"""Read the banana detection dataset images and labels."""
data_dir = d2l.download_extract('banana-detection')
csv_fname = os.path.join(data_dir, 'bananas_train' if is_train
else 'bananas_val', 'label.csv')
csv_data = pd.read_csv(csv_fname)
csv_data = csv_data.set_index('img_name')
images, targets = [], []
for img_name, target in csv_data.iterrows():
images.append(torchvision.io.read_image(
os.path.join(data_dir, 'bananas_train' if is_train else
'bananas_val', 'images', f'{img_name}')))
# Here `target` contains (class, upper-left x, upper-left y,
# lower-right x, lower-right y), where all the images have the same
# banana class (index 0)
targets.append(list(target))
return images, torch.tensor(targets).unsqueeze(1) / 256Wrap the loader in a framework-native Dataset so we get a standard DataLoader:
class BananasDataset(torch.utils.data.Dataset):
"""A customized dataset to load the banana detection dataset."""
def __init__(self, is_train):
self.features, self.labels = read_data_bananas(is_train)
print('read ' + str(len(self.features)) + (f' training examples' if
is_train else f' validation examples'))
def __getitem__(self, idx):
return (self.features[idx].float(), self.labels[idx])
def __len__(self):
return len(self.features)The batch shape check should show why detection labels need an extra object dimension: images have the usual framework layout, while labels are (batch, max_objects, 5).
Each label row is (class, x1, y1, x2, y2) with normalized corners. In this dataset max_objects = 1, but the same layout supports variable object counts by padding.

max_objects with -1 class for ignore.