from mxnet import np, npx
from mxnet.gluon import nn
npx.set_np()Modern networks aren’t flat stacks. ResNet-152 has 152 conv layers, organized into a handful of repeating patterns. Transformers stack 12, 24, 96 identical blocks. Writing them one layer at a time would be miserable.
The module abstraction (nn.Module in PyTorch, flax.linen.Module in JAX) handles the recursion. A module can be a single layer, a block of layers, or the whole model — all three are the same Python class.
Layers compose into modules; modules compose into models.
The framework asks five things of every module:
forward(x).Subclass nn.Module, write __init__ + forward, and the base class supplies the bookkeeping automatically.
For a linear chain of layers, nn.Sequential does everything. Construct, call, done:
Sequential is a module. Internally it stores its children in a list and the forward walks them in order. “List of layers, run them in sequence” — that’s all.
Sequential is good when the topology is a chain. For anything else, define your own subclass. The pattern: name sub-modules in __init__, write forward to use them:
class MLP(nn.Block):
def __init__(self):
# Call the constructor of the MLP parent class nn.Block to perform
# the necessary initialization
super().__init__()
self.hidden = nn.Dense(256, activation='relu')
self.out = nn.Dense(10)
# Define the forward propagation of the model, that is, how to return the
# required model output based on the input X
def forward(self, X):
return self.out(self.hidden(X))The two attributes self.hidden and self.out aren’t ordinary fields — assigning a Module to a Module attribute registers it as a child. From this moment on:
net.parameters() includes both layers’ weights/biases.net.to('cuda') moves both to GPU.net.state_dict() gives a flat dict of every parameter.Total user code: ~6 lines.
What does nn.Sequential actually do? Almost nothing — its implementation in 4 lines:
class MySequential(nn.Block):
def __init__(self):
super().__init__()
# Keep strong refs ourselves: _children in Gluon 2.0 holds weakrefs,
# so we'd otherwise lose blocks to GC right after add() returns.
self._layers = []
def add(self, block):
# block is an instance of a Block subclass. register_child tracks it
# for parameter discovery; the strong ref in self._layers keeps it
# alive (matches the upstream nn.Sequential pattern).
self._layers.append(block)
self.register_child(block)
def forward(self, X):
# _children.values() yields weakrefs; call them to dereference.
for block in self._children.values():
X = block()(X)
return Xforward is just PythonThis is the superpower of the module abstraction: forward is normal Python. Use loops, conditionals, random tensors, anything you’d write in numpy:
from mxnet import gluon
class FixedHiddenMLP(nn.Block):
def __init__(self):
super().__init__()
# Random weight parameters wrapped in gluon.Constant are not updated
# during training (i.e., constant parameters)
self.rand_weight = gluon.Constant(np.random.uniform(size=(20, 20)))
self.dense = nn.Dense(20, activation='relu')
def forward(self, X):
X = self.dense(X)
# Use the created constant parameters, as well as the relu and dot
# functions
X = npx.relu(np.dot(X, self.rand_weight.data()) + 1)
# Reuse the fully connected layer. This is equivalent to sharing
# parameters with two fully connected layers
X = self.dense(X)
# Control flow
while np.abs(X).sum() > 1:
X /= 2
return X.sum()The while loop, the fixed rand_weight, even reusing self.linear twice (parameter sharing!) all work, and all flow gradients correctly:
Modules nest to any depth. A NestMLP holds a Sequential; a top-level Sequential holds a NestMLP + a Linear + a FixedHiddenMLP:
class NestMLP(nn.Block):
def __init__(self):
super().__init__()
self.net = nn.Sequential()
self.net.add(nn.Dense(64, activation='relu'),
nn.Dense(32, activation='relu'))
self.dense = nn.Dense(16, activation='relu')
def forward(self, X):
return self.dense(self.net(X))
chimera = nn.Sequential()
chimera.add(NestMLP(), nn.Dense(20), FixedHiddenMLP())
chimera.initialize()
chimera(X)The framework recursively walks this tree to find every parameter. Every modern architecture is built this way: ResNet = blocks of ResBlocks of conv+BN+ReLU. Transformer = blocks of attention+FFN. Same recursion every time.
Sequential is a 4-line module that runs children in order; for arbitrary topologies, subclass and write forward.forward is plain Python — control flow, parameter sharing, fixed buffers all welcome.