Multiscale Object Detection

Multiscale Detection

A single feature map can’t detect objects at all scales — small objects are tiny on the deep feature maps, large objects don’t fit in the receptive field of the early ones. The fix: generate anchors on multiple feature maps, each tuned to a different size range.

The recipe — used by SSD and FPN:

Early feature map (high resolution) → small anchors for small objects.
Middle feature map → medium anchors.
Deep feature map (low resolution, large receptive field) → large anchors.

Each feature map gets its own classification + regression heads. Predictions from all maps are concatenated, then NMS prunes the result.

Setup

%matplotlib inline
from d2l import mxnet as d2l
from mxnet import image, np, npx

npx.set_np()

img = image.imread('../img/catdog.jpg')
h, w = img.shape[:2]
h, w

(561, 728)

Anchors on a feature map

Tile each pixel of the feature map with n + m - 1 anchors. The pixel positions back-project to image coords, so a smaller feature map means fewer candidate centers but larger receptive fields:

def display_anchors(fmap_w, fmap_h, s):
    d2l.set_figsize()
    # Values on the first two dimensions do not affect the output
    fmap = np.zeros((1, 10, fmap_h, fmap_w))
    anchors = npx.multibox_prior(fmap, sizes=s, ratios=[1, 2, 0.5])
    bbox_scale = np.array((w, h, w, h))
    d2l.show_bboxes(d2l.plt.imshow(img.asnumpy()).axes,
                    anchors[0] * bbox_scale)

Small objects on a fine map

4 \times 4 feature map, small anchor scale → dense coverage of small image regions. Notice the many anchor centers: that density is what small objects need.

display_anchors(fmap_w=4, fmap_h=4, s=[0.15])

Medium objects on a coarser map

2 \times 2 feature map, larger anchor scale — fewer anchors, each covering more area:

display_anchors(fmap_w=2, fmap_h=2, s=[0.4])

Large objects on the coarsest map

1 \times 1 feature map, anchor scale 0.8 — the whole image as a single anchor, with several aspect ratios. This level cannot localize tiny details, but it matches objects that occupy most of the image.

display_anchors(fmap_w=1, fmap_h=1, s=[0.8])

Recap

Multiscale = anchors generated on several feature maps at different resolutions.
Each feature map handles its own size range; total anchor count grows linearly with #scales.
The basis of SSD’s pyramid; FPN improves on this with top-down feature merging.
Modern detectors (RetinaNet, FCOS, DETR) all rely on some multiscale prior.