from typing import List
from d2l import jax as d2l
from flax import linen as nn
import jax
from jax import numpy as jnpModern networks aren’t flat stacks. ResNet-152 has 152 conv layers, organized into a handful of repeating patterns. Transformers stack 12, 24, 96 identical blocks. Writing them one layer at a time would be miserable.
The module abstraction (nn.Module in PyTorch, flax.linen.Module in JAX) handles the recursion. A module can be a single layer, a block of layers, or the whole model — all three are the same Python class.
Layers compose into modules; modules compose into models.
The framework asks five things of every module:
forward(x).Subclass nn.Module, write __init__ + forward, and the base class supplies the bookkeeping automatically.
For a linear chain of layers, nn.Sequential does everything. Construct, call, done:
(2, 10)
Sequential is a module. Internally it stores its children in a list and the forward walks them in order. “List of layers, run them in sequence” — that’s all.
Sequential is good when the topology is a chain. For anything else, define your own subclass. The pattern: name sub-modules in __init__, write forward to use them:
The two attributes self.hidden and self.out aren’t ordinary fields — assigning a Module to a Module attribute registers it as a child. From this moment on:
net.parameters() includes both layers’ weights/biases.net.to('cuda') moves both to GPU.net.state_dict() gives a flat dict of every parameter.Total user code: ~6 lines.
What does nn.Sequential actually do? Almost nothing — its implementation in 4 lines:
forward is just PythonThis is the superpower of the module abstraction: forward is normal Python. Use loops, conditionals, random tensors, anything you’d write in numpy:
class FixedHiddenMLP(nn.Module):
def setup(self):
# Random weight parameters that will not compute gradients and
# therefore keep constant during training
self.rand_weight = jax.random.uniform(d2l.get_key(), (20, 20))
self.dense = nn.Dense(20)
def __call__(self, X):
X = self.dense(X)
X = nn.relu(X @ self.rand_weight + 1)
# Reuse the fully connected layer. This is equivalent to sharing
# parameters with two fully connected layers
X = self.dense(X)
# Control flow
while jnp.abs(X).sum() > 1:
X /= 2
return X.sum()The while loop, the fixed rand_weight, even reusing self.linear twice (parameter sharing!) all work, and all flow gradients correctly:
Array(0.11348616, dtype=float32)
Modules nest to any depth. A NestMLP holds a Sequential; a top-level Sequential holds a NestMLP + a Linear + a FixedHiddenMLP:
class NestMLP(nn.Module):
def setup(self):
self.net = nn.Sequential([nn.Dense(64), nn.relu,
nn.Dense(32), nn.relu])
self.dense = nn.Dense(16)
def __call__(self, X):
return self.dense(self.net(X))
chimera = nn.Sequential([NestMLP(), nn.Dense(20), FixedHiddenMLP()])
params = chimera.init(d2l.get_key(), X)
chimera.apply(params, X)Array(-0.00752359, dtype=float32)
The framework recursively walks this tree to find every parameter. Every modern architecture is built this way: ResNet = blocks of ResBlocks of conv+BN+ReLU. Transformer = blocks of attention+FFN. Same recursion every time.
Sequential is a 4-line module that runs children in order; for arbitrary topologies, subclass and write forward.forward is plain Python — control flow, parameter sharing, fixed buffers all welcome.