Region-based CNNs (R-CNNs)

The R-CNN Family

SSD does it all in one forward pass. The R-CNN family takes a different approach: first propose regions of interest, then classify and refine each one. Slower per image but historically more accurate and easier to extend (masks, keypoints).

The lineage:

R-CNN (2014) — selective search + per-region CNN.
Fast R-CNN (2015) — one CNN forward, RoI pooling per region. ~100× speedup.
Faster R-CNN (2016) — learn the region proposal too.
Mask R-CNN (2017) — adds a per-RoI segmentation head.

R-CNN

For each of ~2k selective-search proposals, warp to fixed size, run a CNN, classify with an SVM, regress a refined box. Conceptually clear, computationally horrible — 2k forward passes per image:

R-CNN: per-proposal forward passes.

Fast R-CNN

One forward pass on the whole image. Proposals come from the same selective search, but they index into the shared feature map via RoI pooling, which crops and resizes a variable rectangle to a fixed-size feature:

Fast R-CNN: shared backbone + RoI pooling per proposal.

RoI pooling

Variable rectangle in feature space → fixed grid (e.g. 2 \times 2). Each output cell is the max over its sub-region of the rectangle. Differentiable, fast, batchable:

2 \times 2 RoI pooling: max-pool each sub-region of the proposal to a fixed-size output.

from d2l import tensorflow as d2l
import tensorflow as tf
import numpy as np

# TF uses NHWC; we store the 4x4 feature map as (1, 4, 4, 1)
X = tf.cast(tf.reshape(tf.range(16), (1, 4, 4, 1)), tf.float32)
X

<tf.Tensor: shape=(1, 4, 4, 1), dtype=float32, numpy=
array([[[[ 0.],
         [ 1.],
         [ 2.],
         [ 3.]],

...
         [11.]],

        [[12.],
         [13.],
         [14.],
         [15.]]]], dtype=float32)>

RoI coordinates

Each RoI row stores (batch_id, x1, y1, x2, y2) in input-image coordinates. spatial_scale maps those coordinates onto the shared feature map before pooling:

rois = np.array([[0, 0, 0, 20, 20], [0, 0, 10, 30, 30]], dtype=np.float32)

RoI pooling output

No matter how large the proposal is, the result has fixed shape (num_rois, channels, pooled_h, pooled_w). That fixed-size tensor is what lets Fast R-CNN batch all region heads:

# TensorFlow does not ship a built-in RoI pooling op in the public API,
# so we implement max-RoI pooling manually.  X is NHWC here.
def roi_pool(X, rois, pooled_size, spatial_scale):
    """RoI max-pooling for NHWC tensors."""
    num_rois = rois.shape[0]
    ph, pw = pooled_size
    outputs = []
    for i in range(num_rois):
        roi = rois[i]
        batch_idx = int(roi[0])
        x1 = int(np.round(float(roi[1]) * spatial_scale))
        y1 = int(np.round(float(roi[2]) * spatial_scale))
        x2 = int(np.round(float(roi[3]) * spatial_scale))
        y2 = int(np.round(float(roi[4]) * spatial_scale))
        x2, y2 = max(x2, x1 + 1), max(y2, y1 + 1)
        # roi_feat shape: (h_roi, w_roi, C)
        roi_feat = X[batch_idx, y1:y2, x1:x2, :]
        h = roi_feat.shape[0]
        w = roi_feat.shape[1]
        bin_h = np.linspace(0, h, ph + 1).astype(int)
        bin_w = np.linspace(0, w, pw + 1).astype(int)
        pooled_bins = []
        for pi in range(ph):
            row_bins = []
            for pj in range(pw):
                sub = roi_feat[bin_h[pi]:bin_h[pi+1],
                               bin_w[pj]:bin_w[pj+1], :]
                row_bins.append(tf.reduce_max(sub, axis=[0, 1]))
            pooled_bins.append(tf.stack(row_bins, axis=0))  # (pw, C)
        # stack rows -> (ph, pw, C)
        outputs.append(tf.stack(pooled_bins, axis=0))
    # (num_rois, ph, pw, C)
    return tf.stack(outputs, axis=0)

roi_pool(X, rois, pooled_size=(2, 2), spatial_scale=0.1)

<tf.Tensor: shape=(2, 2, 2, 1), dtype=float32, numpy=
array([[[[ 0.],
         [ 1.]],

        [[ 4.],
         [ 5.]]],
...

       [[[ 4.],
         [ 6.]],

        [[ 8.],
         [10.]]]], dtype=float32)>

Faster R-CNN

Replace selective search with a learnable Region Proposal Network. The RPN is a small CNN head sharing the same backbone — it proposes anchors that the second-stage head classifies and refines. End-to-end trainable:

Faster R-CNN: RPN replaces selective search; one network does both stages.

Mask R-CNN

Add a third per-RoI head — a small FCN that produces a binary mask. Switching from RoI pool to RoI align (no quantization rounding) was crucial for getting masks sharp enough to be useful:

Mask R-CNN: Faster R-CNN + per-RoI mask FCN.

Recap

Two-stage detectors: propose regions, then classify and refine each one.
Fast R-CNN’s contribution: shared backbone + RoI pooling, $$100× faster than R-CNN.
Faster R-CNN’s contribution: learn the proposal too — end-to-end trainable.
Mask R-CNN’s contribution: per-RoI mask head + RoI align — instance segmentation as a small extension.
Single-stage (SSD) wins on speed; two-stage often wins on accuracy. Both are dominant production patterns.