Padding and Stride

Padding and stride control resolution

A plain convolution always shrinks its input. With an n \times n image and a k \times k kernel:

n \times n \;\longrightarrow\; (n - k + 1) \times (n - k + 1).

Stack ten 5×5 layers on a 240×240 image:

240 \to 236 \to 232 \to \ldots \to 200.

30% of the area gone — all of it from the boundary. Two knobs to control this: padding (fight the shrink) and stride (lean into it, on purpose).

The boundary problem

Even before stacking, single convs use boundary pixels much less than central ones. Each pixel only contributes when the kernel window covers it — interior pixels appear in many windows, corner pixels in just one:

Pixel utilization: 1×1, 2×2, and 3×3 kernels. Larger kernels make the boundary problem worse.

Information at the edges is systematically underweighted. Padding fixes both the shrinking and the underweighting.

Padding: add zeros around the input

Add a “frame” of zero-valued pixels around the input. The kernel can now slide further, including positions where its window hangs off the original image:

3×3 input padded to 5×5; 2×2 kernel gives a 4×4 output. The shaded element is 0{\cdot}0 + 0{\cdot}1 + 0{\cdot}2 + 0{\cdot}3 = 0.

With p_h rows and p_w columns of padding total:

(n_h - k_h + p_h + 1) \times (n_w - k_w + p_w + 1).

To preserve shape: pick p_h = k_h - 1, p_w = k_w - 1.

Why kernels are usually odd

For odd k, (k-1)/2 is an integer — we can pad symmetrically, the same on both sides, and the output position (i, j) corresponds to a window centered on input (i, j). Clean to reason about.

Standard sizes: 1, 3, 5, 7.
“SAME padding” = p = (k-1)/2 → output shape = input shape.
Even k forces a left/right asymmetry — pad floor on one side, ceil on the other.

That’s why every modern CNN you see uses 3×3 kernels with padding 1, or 5×5 with padding 2, or 7×7 with padding 3.

Padding helper

Define a helper to wrap input/output reshaping, then ask the basic question: 8×8 input, 3×3 kernel, padding=1 — what’s the output shape?

from mxnet import np, npx
from mxnet.gluon import nn
npx.set_np()

Shape-preserving padding

# We define a helper function to calculate convolutions. It initializes 
# the convolutional layer weights and performs corresponding dimensionality 
# elevations and reductions on the input and output
def comp_conv2d(conv2d, X):
    conv2d.initialize()
    # (1, 1) indicates that batch size and the number of channels are both 1
    X = X.reshape((1, 1) + X.shape)
    Y = conv2d(X)
    # Strip the first two dimensions: examples and channels
    return Y.reshape(Y.shape[2:])

# 1 row and column is padded on either side, so a total of 2 rows or columns are added
conv2d = nn.Conv2D(1, kernel_size=3, padding=1)
X = np.random.uniform(size=(8, 8))
comp_conv2d(conv2d, X).shape

(8, 8)

Same shape — the padded conv is “shape-preserving”. With an asymmetric kernel, mirror the asymmetry in the padding:

# We use a convolution kernel with height 5 and width 3. The padding on
# either side of the height and width are 2 and 1, respectively
conv2d = nn.Conv2D(1, kernel_size=(5, 3), padding=(2, 1))
comp_conv2d(conv2d, X).shape

(8, 8)

Stride: skipping positions on purpose

The opposite problem: sometimes the input is huge and we want to shrink fast. Move the kernel by s > 1 at each step — skipping intermediate positions:

Cross-correlation with vertical stride 3 and horizontal stride 2. The kernel jumps three rows down and two columns right.

Computational benefit: s\times fewer positions to evaluate in each direction, so s_h s_w \times fewer operations. Statistical benefit: aggressive downsampling forces the network to summarize.

Stride formula

With strides s_h vertically and s_w horizontally:

\Big\lfloor \frac{n_h - k_h + p_h + s_h}{s_h} \Big\rfloor \times \Big\lfloor \frac{n_w - k_w + p_w + s_w}{s_w} \Big\rfloor.

Two helpful special cases:

SAME-padded (p = k - 1): output is \lfloor (n + s - 1)/s \rfloor.
SAME + s divides n: output is exactly n / s.

So the standard “halve the resolution” recipe — kernel 3, padding 1, stride 2 — turns n \times n into \lceil n/2 \rceil \times \lceil n/2 \rceil.

Stride in code

Halving an 8×8 input:

conv2d = nn.Conv2D(1, kernel_size=3, padding=1, strides=2)
comp_conv2d(conv2d, X).shape

(4, 4)

A more aggressive (and asymmetric) version — kernel 3×5, padding 0×1, stride 3×4:

conv2d = nn.Conv2D(1, kernel_size=(3, 5), padding=(0, 1), strides=(3, 4))
comp_conv2d(conv2d, X).shape

(2, 2)

The output formula above predicts the shape; the code just confirms it.

Dilation: nine weights, a wider view

A dilated convolution spaces the kernel taps d pixels apart. Effective kernel size k + (k-1)(d-1): a 3×3 kernel at dilation 2 samples a 5×5 window at unchanged cost.

A 3×3 kernel at dilation 1 and dilation 2.

# Dilation 2 gives the 3x3 kernel a 5x5 footprint, so both convolutions
# shrink the 8x8 input to 4x4
conv2d = nn.Conv2D(1, kernel_size=3, dilation=2)
conv5 = nn.Conv2D(1, kernel_size=5)
comp_conv2d(conv2d, X).shape, comp_conv2d(conv5, X).shape

((4, 4), (4, 4))

Doubling d at every layer grows the receptive field exponentially with depth. Standard tool in dense prediction (FCN, DeepLab), where the output must keep full resolution.

Three patterns to remember

Most production CNNs are built from these three:

	kernel	padding	stride	output
Preserve	3	1	1	n \times n
Halve	3	1	2	n/2 \times n/2
Patchify	k	0	k	n/k \times n/k

Preserve: ResNet feature mixing. Halve: standard downsample layer. Patchify: ViT turns 224×224 into a 14×14 grid of 16×16 patches in one shot.

Recap

Vanilla conv shrinks: n \to n - k + 1. Cumulative shrinkage destroys boundary information.
Padding p grows the input — pick p = k - 1 to preserve shape (“SAME”). Odd kernels make this symmetric.
Stride s skips positions — s = 2 is the standard downsampler; large s patchifies.
Dilation d spreads the taps: effective kernel k + (k-1)(d-1), exponential receptive-field growth at fixed cost.
One formula covers all three: \text{out} = \Big\lfloor \frac{n - d(k-1) - 1 + p}{s} \Big\rfloor + 1.
Pad and stride per-axis; height and width are independent.