Custom Layers

Custom Layers

torch.nn ships 100+ layers, but occasionally — a new architecture, an unusual normalization, a custom block — you need one the framework doesn’t have.

Writing one is trivial: subclass nn.Module, override forward. Two flavors:

  • Stateless — pure transforms. Just override forward.
  • Stateful — your own Linear, low-rank weight, etc. Wrap learnable tensors in nn.Parameter.

The custom layer composes with built-ins automatically — Sequential, parameters(), to(device), checkpointing.

Stateless layer: a centering operator

Subtract the row-wise mean from each input. Nothing to learn — pure transform:

from d2l import torch as d2l
import torch
from torch import nn
from torch.nn import functional as F
class CenteredLayer(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, X):
        return X - X.mean()

Standalone use:

layer = CenteredLayer()
layer(d2l.tensor([1.0, 2, 3, 4, 5]))
tensor([-2., -1.,  0.,  1.,  2.])

The output mean is (numerically) zero — by construction.

Composes with built-ins

Drop the custom layer into a Sequential like any other:

net = nn.Sequential(nn.LazyLinear(128), CenteredLayer())
Y = net(d2l.rand(4, 8))
Y.mean()
tensor(7.2177e-09, grad_fn=<MeanBackward0>)

The framework can’t tell CenteredLayer apart from Linear or ReLU — they’re all just nn.Modules.

Stateful layer: hand-rolled Linear

Implement a fully-connected layer from scratch. The one important step: wrap learnable tensors in nn.Parameter so they’re auto-registered for training:

class MyDense(nn.Module):
    def __init__(self, in_units, units):
        super().__init__()
        # Scaled init (Xavier-ish) keeps activations bounded on size-64 inputs
        self.weight = nn.Parameter(torch.randn(in_units, units) / in_units**0.5)
        self.bias = nn.Parameter(torch.zeros(units,))
        
    def forward(self, X):
        linear = torch.matmul(X, self.weight) + self.bias
        return F.relu(linear)
dense = MyDense(5, 3)
dense.weight
Parameter containing:
tensor([[ 0.3633, -0.0541, -0.7462],
        [ 0.4781,  0.6925,  0.9072],
        [ 0.2740,  0.0132, -0.5635],
        [ 0.8720, -0.0861, -0.3831],
        [ 0.1147, -0.3057, -0.7548]], requires_grad=True)

What nn.Parameter buys you

After linear = MyLinear(5, 3):

  • linear.weight and linear.bias are tracked parameters.
  • linear.parameters() yields both — feed to the optimizer.
  • state_dict() saves them; linear.to('cuda') moves them.

All for free, just by declaring nn.Parameter in __init__.

Test drive

dense(torch.rand(2, 5))
tensor([[0.5718, 0.1859, 0.0000],
        [0.3187, 0.0000, 0.0000]], grad_fn=<ReluBackward0>)

Stack two MyLinears — same Sequential plumbing as built-in layers:

net = nn.Sequential(MyDense(64, 8), MyDense(8, 1))
net(torch.rand(2, 64))
tensor([[0.1139],
        [0.0000]], grad_fn=<ReluBackward0>)

When to write a custom layer

Real-world cases that justify a custom layer:

  • Novel architectural blocks — gated linear units, factorized weight matrices, low-rank parameterizations (LoRA).
  • Custom normalization — group norm with non-standard groups, layer-norm variants.
  • Tied/shared weights with structure — embedding + output projection sharing in language models.
  • Frozen “buffers” — running statistics in BatchNorm, position-specific masks. Use register_buffer for non-trainable tensors that should still travel with the module (saved, moved to GPU, etc.).

Recap

  • Custom layer = nn.Module subclass with a forward.
  • Stateless: just override forward. Stateful: wrap learnable tensors in nn.Parameter.
  • Use register_buffer for non-trainable state that should still travel with the module.
  • Composes with built-in layers exactly the same as a built-in. No special handling.
  • The escape hatch when the standard layer zoo doesn’t cover what you actually need.