Custom Layers

Custom Layers

torch.nn ships 100+ layers, but occasionally — a new architecture, an unusual normalization, a custom block — you need one the framework doesn’t have.

Writing one is trivial: subclass nn.Module, override forward. Two flavors:

  • Stateless — pure transforms. Just override forward.
  • Stateful — your own Linear, low-rank weight, etc. Wrap learnable tensors in nn.Parameter.

The custom layer composes with built-ins automatically — Sequential, parameters(), to(device), checkpointing.

Stateless layer: a centering operator

Subtract the row-wise mean from each input. Nothing to learn — pure transform:

from d2l import mxnet as d2l
from mxnet import np, npx
from mxnet.gluon import nn
npx.set_np()
class CenteredLayer(nn.Block):
    def __init__(self):
        super().__init__()

    def forward(self, X):
        return X - X.mean()

Standalone use:

layer = CenteredLayer()
layer(d2l.tensor([1.0, 2, 3, 4, 5]))

The output mean is (numerically) zero — by construction.

Composes with built-ins

Drop the custom layer into a Sequential like any other:

net = nn.Sequential()
net.add(nn.Dense(128), CenteredLayer())
net.initialize()
Y = net(d2l.rand(4, 8))
Y.mean()

The framework can’t tell CenteredLayer apart from Linear or ReLU — they’re all just nn.Modules.

Stateful layer: hand-rolled Linear

Implement a fully-connected layer from scratch. The one important step: wrap learnable tensors in nn.Parameter so they’re auto-registered for training:

from mxnet import gluon

class MyDense(nn.Block):
    def __init__(self, units, in_units):
        super().__init__()
        self.weight = gluon.Parameter('weight', shape=(in_units, units))
        self.bias = gluon.Parameter('bias', shape=(units,))

    def forward(self, x):
        linear = np.dot(x, self.weight.data(ctx=x.ctx)) + self.bias.data(
            ctx=x.ctx)
        return npx.relu(linear)
dense = MyDense(units=3, in_units=5)
dense.params

What nn.Parameter buys you

After linear = MyLinear(5, 3):

  • linear.weight and linear.bias are tracked parameters.
  • linear.parameters() yields both — feed to the optimizer.
  • state_dict() saves them; linear.to('cuda') moves them.

All for free, just by declaring nn.Parameter in __init__.

Test drive

dense.initialize()
dense(np.random.uniform(size=(2, 5)))

Stack two MyLinears — same Sequential plumbing as built-in layers:

net = nn.Sequential()
net.add(MyDense(8, in_units=64),
        MyDense(1, in_units=8))
net.initialize()
net(np.random.uniform(size=(2, 64)))

When to write a custom layer

Real-world cases that justify a custom layer:

  • Novel architectural blocks — gated linear units, factorized weight matrices, low-rank parameterizations (LoRA).
  • Custom normalization — group norm with non-standard groups, layer-norm variants.
  • Tied/shared weights with structure — embedding + output projection sharing in language models.
  • Frozen “buffers” — running statistics in BatchNorm, position-specific masks. Use register_buffer for non-trainable tensors that should still travel with the module (saved, moved to GPU, etc.).

Recap

  • Custom layer = nn.Module subclass with a forward.
  • Stateless: just override forward. Stateful: wrap learnable tensors in nn.Parameter.
  • Use register_buffer for non-trainable state that should still travel with the module.
  • Composes with built-in layers exactly the same as a built-in. No special handling.
  • The escape hatch when the standard layer zoo doesn’t cover what you actually need.