from d2l import tensorflow as d2l
import tensorflow as tfA fully-connected layer on a 1-megapixel RGB image needs roughly 3 million weights per output unit — wildly wasteful, since pixel correlations are local and the same edge detector should work everywhere.
A convolutional layer swaps this for two strong inductive biases:
Thousands of parameters instead of millions, with exactly the right prior for natural images.
Slide a small kernel \mathbf{K} over the input \mathbf{X}. At each position, multiply elementwise and sum:
Y[i, j] = \sum_{a, b} X[i+a, j+b]\, K[a, b].
Cross-correlation: 3×3 input × 2×2 kernel → 2×2 output. Shaded element: 0{\cdot}0 + 1{\cdot}1 + 3{\cdot}2 + 4{\cdot}3 = 19.
The output is smaller than the input by k - 1 in each direction — same shrinking we’ll undo with padding next section.
Two nested loops over output positions. Each cell is a slice multiplied elementwise with the kernel and summed:
Verify against the figure — 3×3 input × 2×2 kernel → 2×2 output with the worked-out values:
<tf.Variable 'Variable:0' shape=(2, 2) dtype=float32, numpy=
array([[19., 25.],
[37., 43.]], dtype=float32)>
Wrap the operator as a learnable Module. Two parameters: the kernel weights and a scalar bias:
class Conv2D(tf.keras.layers.Layer):
def __init__(self):
super().__init__()
def build(self, kernel_size):
initializer = tf.random_normal_initializer()
self.weight = self.add_weight(name='w', shape=kernel_size,
initializer=initializer)
self.bias = self.add_weight(name='b', shape=(1, ),
initializer=initializer)
def call(self, inputs):
return corr2d(inputs, self.weight) + self.biasThese are the only learnable parameters of a single-channel conv layer. A 3×3 conv has nine weights regardless of input size — that’s the parameter savings the inductive bias buys us.
Build an image with a vertical edge in the middle: 1s on the outsides, 0s in the middle four columns:
<tf.Variable 'Variable:0' shape=(6, 8) dtype=float32, numpy=
array([[1., 1., 0., 0., 0., 0., 1., 1.],
[1., 1., 0., 0., 0., 0., 1., 1.],
[1., 1., 0., 0., 0., 0., 1., 1.],
[1., 1., 0., 0., 0., 0., 1., 1.],
[1., 1., 0., 0., 0., 0., 1., 1.],
[1., 1., 0., 0., 0., 0., 1., 1.]], dtype=float32)>
Cross-correlate the image with the difference kernel: +1 at each white→black transition, -1 at each black→white, zero everywhere else:
<tf.Variable 'Variable:0' shape=(6, 7) dtype=float32, numpy=
array([[ 0., 1., 0., 0., 0., -1., 0.],
[ 0., 1., 0., 0., 0., -1., 0.],
[ 0., 1., 0., 0., 0., -1., 0.],
[ 0., 1., 0., 0., 0., -1., 0.],
[ 0., 1., 0., 0., 0., -1., 0.],
[ 0., 1., 0., 0., 0., -1., 0.]], dtype=float32)>
Transpose the image so the edge is now horizontal — the same kernel detects nothing:
<tf.Variable 'Variable:0' shape=(8, 5) dtype=float32, numpy=
array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]], dtype=float32)>
Filters are direction-sensitive. Real ConvNets stack many filters per layer to cover all directions / patterns.
We don’t have to design kernels by hand. Random init, SGD on squared error against ground truth \mathbf{Y}:
# Construct a two-dimensional convolutional layer with 1 output channel and a
# kernel of shape (1, 2). For the sake of simplicity, we ignore the bias here
conv2d = tf.keras.layers.Conv2D(1, (1, 2), use_bias=False)
# The two-dimensional convolutional layer uses four-dimensional input and
# output in the format of (example, height, width, channel), where the batch
# size (number of examples in the batch) and the number of channels are both 1
X = tf.reshape(X, (1, 6, 8, 1))
Y = tf.reshape(Y, (1, 6, 7, 1))
lr = 3e-2 # Learning rate
for i in range(10):
with tf.GradientTape() as g:
Y_hat = conv2d(X)
l = (Y_hat - Y) ** 2
# Update the kernel
update = tf.multiply(lr, g.gradient(l, conv2d.trainable_weights)[0])
conv2d.kernel.assign(conv2d.kernel - update)
if (i + 1) % 2 == 0:
print(f'epoch {i + 1}, loss {tf.reduce_sum(l):.3f}')epoch 2, loss 6.740
epoch 4, loss 1.719
epoch 6, loss 0.529
epoch 8, loss 0.188
epoch 10, loss 0.072
After 10 steps the loss is near zero, and the learned kernel is essentially [1, -1]:
<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[ 1.0162398, -0.9621782]], dtype=float32)>
The rest of the chapter is built on this idea — let gradient descent discover what filters the data needs.
The receptive field of an output cell = the set of input positions that can affect it.
Local kernels + depth = global reach without the parameter cost of large kernels.
Hubel & Wiesel-style filters in the visual cortex. Trained CNN filters look strikingly similar.