Single Shot MultiBox Detector (Liu et al., 2016) is the prototype single-stage detector. One forward pass produces class scores and box offsets for every anchor at every scale; NMS keeps the survivors.
The architecture: a CNN trunk, then a pyramid of feature maps at decreasing resolutions. Each level has its own pair of 1×1-style heads — one for class scores, one for box offsets. Predictions from all levels are concatenated.
SSD = base network + several multiscale feature blocks; each block has its own anchor predictor.
Scaling to objects in images
Objects appear at different pixel sizes. SSD handles this by predicting from several feature maps at once:
early maps: many spatial cells, small receptive fields, small anchor boxes;
The model does not resize every candidate region. It learns classification and offset heads at each scale, then pools all anchors into one detection set before NMS.
Class and box prediction heads
For a feature map with a anchors per pixel and q classes, the class head is a 3×3 conv with a(q+1) output channels; the box head outputs 4a:
def flatten_pred(pred):return npx.batch_flatten(pred.transpose(0, 2, 3, 1))def concat_preds(preds):return np.concatenate([flatten_pred(p) for p in preds], axis=1)
concat_preds([Y1, Y2]).shape
Downsampling block
Halves the feature map resolution between scales — two 3×3 conv-BN-ReLU layers + 2×2 max pool:
Showing the whole class definition on a slide hides the idea; the important contract is the output shape and anchor ordering. Every anchor needs one class vector and one four-number offset vector.
TinySSD output shapes
For a 256 \times 256 image, the five feature maps create (32^2 + 16^2 + 8^2 + 4^2 + 1) \times 4 = 5444 anchors. With one foreground class, expect:
anchors: (batch, 5444, 4);
class logits: (batch, 5444, 2) for background/banana;
def cls_eval(cls_preds, cls_labels):# Because the class prediction results are on the final dimension,# `argmax` needs to specify this dimensionreturnfloat((cls_preds.argmax(axis=-1).astype( cls_labels.dtype) == cls_labels).sum())def bbox_eval(bbox_preds, bbox_labels, bbox_masks):returnfloat((np.abs((bbox_labels - bbox_preds) * bbox_masks)).sum())
Training
Standard SGD loop, two evaluation metrics (class accuracy, box mean abs error). Read them together: class accuracy is dominated by many background anchors, while box error only makes sense on matched positive anchors.
num_epochs, timer =20, d2l.Timer()animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs], legend=['class error', 'bbox mae'])for epoch inrange(num_epochs):# Sum of training accuracy, no. of examples in sum of training accuracy,# Sum of absolute error, no. of examples in sum of absolute error metric = d2l.Accumulator(4)for features, target in train_iter: timer.start() X = features.as_in_ctx(device) Y = target.as_in_ctx(device)with autograd.record():# Generate multiscale anchor boxes and predict their classes and# offsets anchors, cls_preds, bbox_preds = net(X)# Label the classes and offsets of these anchor boxes bbox_labels, bbox_masks, cls_labels = d2l.multibox_target(anchors, Y)# Calculate the loss function using the predicted and labeled# values of the classes and offsets l = calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels, bbox_masks) l.backward() trainer.step(batch_size) metric.add(cls_eval(cls_preds, cls_labels), cls_labels.size, bbox_eval(bbox_preds, bbox_labels, bbox_masks), bbox_labels.size) cls_err, bbox_mae =1- metric[0] / metric[1], metric[2] / metric[3] animator.add(epoch +1, (cls_err, bbox_mae))print(f'class err {cls_err:.2e}, bbox mae {bbox_mae:.2e}')print(f'{len(train_iter._dataset) / timer.stop():.1f} examples/sec on 'f'{str(device)}')
Visualize all predictions with confidence ≥ 0.9. The useful thing to notice is not the raw tensor length, but whether NMS leaves one tight high-confidence box per banana:
SSD = base CNN + multiscale feature pyramid + per-level class & offset heads.
One forward pass → all anchor predictions; NMS at the end. No region proposal step.
Loss = class cross-entropy + offset L_1, only on positive anchors.
SSD and RetinaNet are anchor-based dense single-stage detectors. YOLO is a related single-stage family, while modern anchor-free detectors remove explicit anchors but keep dense classification/localization over feature maps.