Multiscale Object Detection

Multiscale Detection

A single feature map can’t detect objects at all scales — small objects are tiny on the deep feature maps, large objects don’t fit in the receptive field of the early ones. The fix: generate anchors on multiple feature maps, each tuned to a different size range.

The recipe — used by SSD and FPN:

Early feature map (high resolution) → small anchors for small objects.
Middle feature map → medium anchors.
Deep feature map (low resolution, large receptive field) → large anchors.

Each feature map gets its own classification + regression heads. Predictions from all maps are concatenated, then NMS prunes the result.

Setup

%matplotlib inline
from d2l import tensorflow as d2l
import tensorflow as tf

img = d2l.plt.imread('../img/catdog.jpg')
h, w = img.shape[:2]
h, w

(561, 728)

Anchors on a feature map

Tile each pixel of the feature map with n + m - 1 anchors. The pixel positions back-project to image coords, so a smaller feature map means fewer candidate centers but larger receptive fields:

def display_anchors(fmap_w, fmap_h, s):
    d2l.set_figsize()
    # Values on the first two dimensions do not affect the output
    fmap = tf.zeros((1, 10, fmap_h, fmap_w))
    anchors = d2l.multibox_prior(fmap, sizes=s, ratios=[1, 2, 0.5])
    bbox_scale = tf.constant((w, h, w, h), dtype=tf.float32)
    d2l.show_bboxes(d2l.plt.imshow(img).axes,
                    anchors[0] * bbox_scale)

Small objects on a fine map

4 \times 4 feature map, small anchor scale → dense coverage of small image regions. Notice the many anchor centers: that density is what small objects need.

display_anchors(fmap_w=4, fmap_h=4, s=[0.15])

Medium objects on a coarser map

2 \times 2 feature map, larger anchor scale — fewer anchors, each covering more area:

display_anchors(fmap_w=2, fmap_h=2, s=[0.4])

Large objects on the coarsest map

1 \times 1 feature map, anchor scale 0.8 — the whole image as a single anchor, with several aspect ratios. This level cannot localize tiny details, but it matches objects that occupy most of the image.

display_anchors(fmap_w=1, fmap_h=1, s=[0.8])

Recap

Multiscale = anchors generated on several feature maps at different resolutions.
Each feature map handles its own size range; total anchor count grows linearly with #scales.
The basis of SSD’s pyramid; FPN improves on this with top-down feature merging.
Modern detectors (RetinaNet, FCOS, DETR) all rely on some multiscale prior.