import tensorflow as tfA plain convolution always shrinks its input. With an n \times n image and a k \times k kernel:
n \times n \;\longrightarrow\; (n - k + 1) \times (n - k + 1).
Stack ten 5×5 layers on a 240×240 image:
240 \to 236 \to 232 \to \ldots \to 200.
30% of the area gone — all of it from the boundary. Two knobs to control this: padding (fight the shrink) and stride (lean into it, on purpose).
Even before stacking, single convs use boundary pixels much less than central ones. Each pixel only contributes when the kernel window covers it — interior pixels appear in many windows, corner pixels in just one:
Pixel utilization: 1×1, 2×2, and 3×3 kernels. Larger kernels make the boundary problem worse.
Information at the edges is systematically underweighted. Padding fixes both the shrinking and the underweighting.
Add a “frame” of zero-valued pixels around the input. The kernel can now slide further, including positions where its window hangs off the original image:
3×3 input padded to 5×5; 2×2 kernel gives a 4×4 output. The shaded element is 0{\cdot}0 + 0{\cdot}1 + 0{\cdot}2 + 0{\cdot}3 = 0.
With p_h rows and p_w columns of padding total:
(n_h - k_h + p_h + 1) \times (n_w - k_w + p_w + 1).
To preserve shape: pick p_h = k_h - 1, p_w = k_w - 1.
For odd k, (k-1)/2 is an integer — we can pad symmetrically, the same on both sides, and the output position (i, j) corresponds to a window centered on input (i, j). Clean to reason about.
That’s why every modern CNN you see uses 3×3 kernels with padding 1, or 5×5 with padding 2, or 7×7 with padding 3.
Define a helper to wrap input/output reshaping, then ask the basic question: 8×8 input, 3×3 kernel, padding=1 — what’s the output shape?
# We define a helper function to calculate convolutions. It initializes
# the convolutional layer weights and performs corresponding dimensionality
# elevations and reductions on the input and output
def comp_conv2d(conv2d, X):
# (1, 1) indicates that batch size and the number of channels are both 1
X = tf.reshape(X, (1, ) + X.shape + (1, ))
Y = conv2d(X)
# Strip the first two dimensions: examples and channels
return tf.reshape(Y, Y.shape[1:3])
# 1 row and column is padded on either side, so a total of 2 rows or columns
# are added
conv2d = tf.keras.layers.Conv2D(1, kernel_size=3, padding='same')
X = tf.random.uniform(shape=(8, 8))
comp_conv2d(conv2d, X).shapeTensorShape([8, 8])
Same shape — the padded conv is “shape-preserving”. With an asymmetric kernel, mirror the asymmetry in the padding:
TensorShape([8, 8])
The opposite problem: sometimes the input is huge and we want to shrink fast. Move the kernel by s > 1 at each step — skipping intermediate positions:
Cross-correlation with vertical stride 3 and horizontal stride 2. The kernel jumps three rows down and two columns right.
Computational benefit: s\times fewer positions to evaluate in each direction, so s_h s_w \times fewer operations. Statistical benefit: aggressive downsampling forces the network to summarize.
With strides s_h vertically and s_w horizontally:
\Big\lfloor \frac{n_h - k_h + p_h + s_h}{s_h} \Big\rfloor \times \Big\lfloor \frac{n_w - k_w + p_w + s_w}{s_w} \Big\rfloor.
Two helpful special cases:
So the standard “halve the resolution” recipe — kernel 3, padding 1, stride 2 — turns n \times n into \lceil n/2 \rceil \times \lceil n/2 \rceil.
Halving an 8×8 input:
TensorShape([4, 4])
A more aggressive (and asymmetric) version — kernel 3×5, padding 0×1, stride 3×4:
# tf.keras.Conv2D accepts only 'same'/'valid' for `padding`; we use a
# ZeroPadding2D layer to apply padding=(0, 1) (matching the MX/PT/JAX
# tabs) before the convolution.
conv2d = tf.keras.Sequential([
tf.keras.layers.ZeroPadding2D(padding=(0, 1)),
tf.keras.layers.Conv2D(1, kernel_size=(3, 5), padding='valid',
strides=(3, 4))])
comp_conv2d(conv2d, X).shapeTensorShape([2, 2])
The output formula above predicts the shape; the code just confirms it.
Most production CNNs are built from these three:
| kernel | padding | stride | output | |
|---|---|---|---|---|
| Preserve | 3 | 1 | 1 | n \times n |
| Halve | 3 | 1 | 2 | n/2 \times n/2 |
| Patchify | k | 0 | k | n/k \times n/k |
Preserve: ResNet feature mixing. Halve: standard downsample layer. Patchify: ViT turns 224×224 into a 14×14 grid of 16×16 patches in one shot.