from d2l import tensorflow as d2l
import tensorflow as tfReal images have channels: RGB has 3, a modern CNN’s deep feature map has hundreds (64 → 2048 is typical).
Going deeper, networks trade spatial resolution for channel depth — same information capacity, but representing kinds of features instead of places.
This deck:
With c_i input channels, the kernel becomes c_i \times k_h \times k_w — a 2D filter per input channel. The output is the sum of per-channel cross-correlations:
Y = \sum_{c=1}^{c_i} X_c * K_c.
Two input channels: per-channel cross-correlation, then sum. (1{\cdot}1 + 2{\cdot}2 + 4{\cdot}3 + 5{\cdot}4) + (0{\cdot}0 + 1{\cdot}1 + 3{\cdot}2 + 4{\cdot}3) = 56.
Verify against the figure — same numbers:
<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[ 56., 72.],
[104., 120.]], dtype=float32)>
Each output channel comes from its own set of input-channel filters. Stack c_o such filter sets to get a 4-D kernel of shape c_o \times c_i \times k_h \times k_w:
\mathbf{Y}_j = \sum_{c=1}^{c_i} \mathbf{X}_c * \mathbf{K}_{j, c} \quad\text{for}\quad j = 1, \ldots, c_o.
Intuition: each of the c_o output channels is a different combination of inputs, learned to detect a different feature. The network discovers an entire “feature dictionary” per layer.
Apply the multi-input-channel function c_o times and stack the results along a new leading axis:
Build a 3-output-channel kernel by stacking three offset copies:
TensorShape([3, 2, 2, 2])
<tf.Tensor: shape=(3, 2, 2), dtype=float32, numpy=
array([[[ 56., 72.],
[104., 120.]],
[[ 76., 100.],
[148., 172.]],
[[ 96., 128.],
[192., 224.]]], dtype=float32)>
A conv layer with c_o outputs, c_i inputs, and a k_h \times k_w kernel has
c_o \cdot c_i \cdot k_h \cdot k_w \;+\; c_o
learnable parameters. Standard sizes:
Channel count drives parameter count quadratically. That’s why deeper layers in CNNs widen, but not too much.
A 1×1 kernel has no spatial structure — it doesn’t look at neighbors. So why use it?
Because it acts as a per-pixel fully connected layer across channels. At every spatial position, it computes a linear combination of the c_i input channels into the c_o output channels:
1×1 conv: 3 input channels × 2 output channels. Each output pixel = a 2×3 matrix-vector product on the input channel vector at that position.
At each pixel, the 1×1 conv applies the same c_o \times c_i matrix to the input channel vector. Reshape the spatial axes out and it’s a single matrix multiply:
Modern architectures use them constantly:
A h \times w image with k \times k kernel and c_i \to c_o channels takes
\mathcal{O}(h \cdot w \cdot k^2 \cdot c_i \cdot c_o)
operations. For a 256×256 image, 5×5 kernel, 128→128 channels: ~53 billion multiply-adds. That’s per layer; multiply by depth.
This is why - Convs benefit massively from GPU/TPU acceleration. - Channel reduction tricks (1×1 bottlenecks, depthwise separable, group conv, ResNeXt) get serious attention in efficiency-driven architectures.