Single Shot Multibox Detection

Single Shot Detection

Single Shot MultiBox Detector (Liu et al., 2016) is the prototype single-stage detector. One forward pass produces class scores and box offsets for every anchor at every scale; NMS keeps the survivors.

The architecture: a CNN trunk, then a pyramid of feature maps at decreasing resolutions. Each level has its own pair of 1×1-style heads — one for class scores, one for box offsets. Predictions from all levels are concatenated.

SSD = base network + several multiscale feature blocks; each block has its own anchor predictor.

Scaling to objects in images

Objects appear at different pixel sizes. SSD handles this by predicting from several feature maps at once:

early maps: many spatial cells, small receptive fields, small anchor boxes;
deeper maps: fewer spatial cells, larger receptive fields, larger anchor boxes.

The model does not resize every candidate region. It learns classification and offset heads at each scale, then pools all anchors into one detection set before NMS.

Class and box prediction heads

For a feature map with a anchors per pixel and q classes, the class head is a 3×3 conv with a(q+1) output channels; the box head outputs 4a:

%matplotlib inline
from d2l import mxnet as d2l
from mxnet import autograd, gluon, image, init, np, npx
from mxnet.gluon import nn

npx.set_np()

def cls_predictor(num_anchors, num_classes):
    return nn.Conv2D(num_anchors * (num_classes + 1), kernel_size=3,
                     padding=1)

def bbox_predictor(num_anchors):
    return nn.Conv2D(num_anchors * 4, kernel_size=3, padding=1)

Concatenating across scales

Each level produces predictions of a different shape; flatten and concat them so the loss can run on a single tensor:

def forward(x, block):
    block.initialize()
    return block(x)

Y1 = forward(np.zeros((2, 8, 20, 20)), cls_predictor(5, 10))
Y2 = forward(np.zeros((2, 16, 10, 10)), cls_predictor(3, 10))
Y1.shape, Y2.shape

((2, 55, 20, 20), (2, 33, 10, 10))

def flatten_pred(pred):
    return npx.batch_flatten(pred.transpose(0, 2, 3, 1))

def concat_preds(preds):
    return np.concatenate([flatten_pred(p) for p in preds], axis=1)

concat_preds([Y1, Y2]).shape

(2, 25300)

Downsampling block

Halves the feature map resolution between scales — two 3×3 conv-BN-ReLU layers + 2×2 max pool:

def down_sample_blk(num_channels):
    blk = nn.Sequential()
    for _ in range(2):
        blk.add(nn.Conv2D(num_channels, kernel_size=3, padding=1),
                nn.BatchNorm(in_channels=num_channels),
                nn.Activation('relu'))
    blk.add(nn.MaxPool2D(2))
    return blk

forward(np.zeros((2, 3, 20, 20)), down_sample_blk(10)).shape

(2, 10, 10, 10)

Base network

A small CNN that takes the input image down to the first useful resolution:

def base_net():
    blk = nn.Sequential()
    for num_filters in [16, 32, 64]:
        blk.add(down_sample_blk(num_filters))
    return blk

forward(np.zeros((2, 3, 256, 256)), base_net()).shape

(2, 64, 32, 32)

Five-block pyramid

Stack base network + a few downsampling blocks. Each level exposes its feature map for anchor prediction:

def get_blk(i):
    if i == 0:
        blk = base_net()
    elif i == 4:
        blk = nn.GlobalMaxPool2D()
    else:
        blk = down_sample_blk(128)
    return blk

def blk_forward(X, blk, size, ratio, cls_predictor, bbox_predictor):
    Y = blk(X)
    anchors = d2l.multibox_prior(Y, sizes=size, ratios=ratio)
    cls_preds = cls_predictor(Y)
    bbox_preds = bbox_predictor(Y)
    return (Y, anchors, cls_preds, bbox_preds)

Per-level scales

Bigger anchor scales at deeper levels (small feature map → large receptive field → large anchors):

five feature levels use progressively larger anchors;
each level predicts class logits and box offsets;
predictions are concatenated across levels before loss or decoding.

TinySSD model

The full model is a feature pyramid plus two lightweight heads per level:

\text{image} \rightarrow \{(\text{anchors}_\ell, \text{class}_\ell, \text{box}_\ell)\}_{\ell=1}^{5}.

Showing the whole class definition on a slide hides the idea; the important contract is the output shape and anchor ordering. Every anchor needs one class vector and one four-number offset vector.

TinySSD output shapes

For a 256 \times 256 image, the five feature maps create (32^2 + 16^2 + 8^2 + 4^2 + 1) \times 4 = 5444 anchors. With one foreground class, expect:

anchors: (batch, 5444, 4);
class logits: (batch, 5444, 2) for background/banana;
offsets: (batch, 5444 * 4).

output anchors: (1, 5444, 4)
output class preds: (32, 5444, 2)
output bbox preds: (32, 21776)

Loading data + init

batch_size = 32
train_iter, _ = d2l.load_data_bananas(batch_size)

read 1000 training examples
read 100 validation examples

device, net = d2l.try_gpu(), TinySSD(num_classes=1)
net.initialize(init=init.Xavier(), ctx=device)
trainer = gluon.Trainer(net.collect_params(), 'sgd',
                        {'learning_rate': 0.2, 'wd': 5e-4})

Multi-task loss

Two loss terms:

Classification — cross-entropy over class scores.
Localization — L_1 on box offsets, computed only on positive anchors (ignore the rest).

\mathcal{L} = \text{CE}(\hat{\mathbf{c}}, \mathbf{c}) + \frac{1}{N_+}\sum_i m_i \lVert \hat{\mathbf{t}}_i - \mathbf{t}_i\rVert_1,

where m_i=1 only for anchors matched to an object.

cls_loss = gluon.loss.SoftmaxCrossEntropyLoss()
bbox_loss = gluon.loss.L1Loss()

def calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels, bbox_masks):
    cls = cls_loss(cls_preds, cls_labels)
    bbox = bbox_loss(bbox_preds * bbox_masks, bbox_labels * bbox_masks)
    return cls + bbox

def cls_eval(cls_preds, cls_labels):
    # Because the class prediction results are on the final dimension,
    # `argmax` needs to specify this dimension
    return float((cls_preds.argmax(axis=-1).astype(
        cls_labels.dtype) == cls_labels).sum())

def bbox_eval(bbox_preds, bbox_labels, bbox_masks):
    return float((np.abs((bbox_labels - bbox_preds) * bbox_masks)).sum())

Training

Standard SGD loop, two evaluation metrics (class accuracy, box mean abs error). Read them together: class accuracy is dominated by many background anchors, while box error only makes sense on matched positive anchors.

num_epochs, timer = 20, d2l.Timer()
animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs],
                        legend=['class error', 'bbox mae'])
for epoch in range(num_epochs):
    # Sum of training accuracy, no. of examples in sum of training accuracy,
    # Sum of absolute error, no. of examples in sum of absolute error
    metric = d2l.Accumulator(4)
    for features, target in train_iter:
        timer.start()
        X = features.as_in_ctx(device)
        Y = target.as_in_ctx(device)
        with autograd.record():
            # Generate multiscale anchor boxes and predict their classes and
            # offsets
            anchors, cls_preds, bbox_preds = net(X)
            # Label the classes and offsets of these anchor boxes
            bbox_labels, bbox_masks, cls_labels = d2l.multibox_target(anchors,
                                                                      Y)
            # Calculate the loss function using the predicted and labeled
            # values of the classes and offsets
            l = calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels,
                          bbox_masks)
        l.backward()
        trainer.step(batch_size)
        metric.add(cls_eval(cls_preds, cls_labels), cls_labels.size,
                   bbox_eval(bbox_preds, bbox_labels, bbox_masks),
                   bbox_labels.size)
    cls_err, bbox_mae = 1 - metric[0] / metric[1], metric[2] / metric[3]
    animator.add(epoch + 1, (cls_err, bbox_mae))
print(f'class err {cls_err:.2e}, bbox mae {bbox_mae:.2e}')
print(f'{len(train_iter._dataset) / timer.stop():.1f} examples/sec on '
      f'{str(device)}')

class err 3.59e-03, bbox mae 3.89e-03
1476.3 examples/sec on gpu(0)

Inference

Forward pass → anchors + class scores + offsets → invert offsets → NMS → keep boxes above a confidence threshold:

img = image.imread('../img/banana.jpg')
feature = image.imresize(img, 256, 256).astype('float32')
X = np.expand_dims(feature.transpose(2, 0, 1), axis=0)

def predict(X):
    anchors, cls_preds, bbox_preds = net(X.as_in_ctx(device))
    cls_probs = npx.softmax(cls_preds).transpose(0, 2, 1)
    output = d2l.multibox_detection(cls_probs, bbox_preds, anchors)
    idx = [i for i, row in enumerate(output[0]) if row[0] != -1]
    return output[0, idx]

output = predict(X)

Detect bananas

Visualize all predictions with confidence ≥ 0.9. The useful thing to notice is not the raw tensor length, but whether NMS leaves one tight high-confidence box per banana:

def display(img, output, threshold):
    d2l.set_figsize((5, 5))
    fig = d2l.plt.imshow(img.asnumpy())
    for row in output:
        score = float(row[1])
        if score < threshold:
            continue
        h, w = img.shape[:2]
        bbox = [row[2:6] * np.array((w, h, w, h), ctx=row.ctx)]
        d2l.show_bboxes(fig.axes, bbox, '%.2f' % score, 'w')

display(img, output, threshold=0.9)

sigmas = [10, 1, 0.5]
lines = ['-', '--', '-.']
x = np.arange(-2, 2, 0.1)
d2l.set_figsize()

for l, s in zip(lines, sigmas):
    y = npx.smooth_l1(x, scalar=s)
    d2l.plt.plot(x.asnumpy(), y.asnumpy(), l, label='sigma=%.1f' % s)
d2l.plt.legend();

def focal_loss(gamma, x):
    return -(1 - x) ** gamma * np.log(x)

x = np.arange(0.01, 1, 0.01)
for l, gamma in zip(lines, [0, 1, 5]):
    y = d2l.plt.plot(x.asnumpy(), focal_loss(gamma, x).asnumpy(), l,
                     label='gamma=%.1f' % gamma)
d2l.plt.legend();

Recap

SSD = base CNN + multiscale feature pyramid + per-level class & offset heads.
One forward pass → all anchor predictions; NMS at the end. No region proposal step.
Loss = class cross-entropy + offset L_1, only on positive anchors.
SSD and RetinaNet are anchor-based dense single-stage detectors. YOLO is a related single-stage family, while modern anchor-free detectors remove explicit anchors but keep dense classification/localization over feature maps.