Normalization Layers

BatchNorm stabilizes deep nets

Batch Normalization (Ioffe & Szegedy, 2015) is the single-biggest stability win in modern deep learning.

At each layer, normalize activations within the minibatch to zero mean / unit variance, then rescale with learned \gamma and \beta:

\text{BN}(\mathbf{x}) = \gamma \cdot \frac{\mathbf{x} - \hat\mu_\mathcal{B}}{\sqrt{\hat\sigma_\mathcal{B}^2 + \epsilon}} + \beta.

Why it works

Lets you train much deeper nets — gradients stay well-conditioned through the depth.
Allows higher learning rates; mildly regularizing.
Test time uses running estimates of mean / variance (no minibatch then).
Spawned a family — LayerNorm (per-example, used in Transformers), GroupNorm, InstanceNorm.

From scratch

Compute per-channel mean and variance over the minibatch (and spatial dims, for conv); normalize, then scale + shift:

from d2l import jax as d2l
from flax import nnx
from jax import numpy as jnp
import jax

def batch_norm(X, deterministic, gamma, beta, moving_mean, moving_var, eps,
               momentum):
    # Use `deterministic` to determine whether the current mode is training
    # mode or prediction mode
    if deterministic:
        # In prediction mode, use mean and variance obtained by moving average
        X_hat = (X - moving_mean[...]) / jnp.sqrt(moving_var[...] + eps)
    else:
        assert len(X.shape) in (2, 4)
        if len(X.shape) == 2:
            # When using a fully connected layer, calculate the mean and
            # variance on the feature dimension
            mean = X.mean(axis=0)
            var = ((X - mean) ** 2).mean(axis=0)
        else:
            # When using a two-dimensional convolutional layer, calculate the
            # mean and variance over batch and spatial dimensions. Here we
            # need to maintain the shape of `X`, so that the broadcasting
            # operation can be carried out later
            mean = X.mean(axis=(0, 1, 2), keepdims=True)
            var = ((X - mean) ** 2).mean(axis=(0, 1, 2), keepdims=True)
        # In training mode, the current mean and variance are used
        X_hat = (X - mean) / jnp.sqrt(var + eps)
        # Update the mean and variance using moving average
        moving_mean[...] = momentum * moving_mean[...] + (1.0 - momentum) * mean
        moving_var[...] = momentum * moving_var[...] + (1.0 - momentum) * var
    Y = gamma * X_hat + beta  # Scale and shift
    return Y

Wrapping as a `Module`

Buffers for moving_mean / moving_var (updated only during training); learnable gamma / beta parameters:

class BatchNorm(nnx.Module):
    deterministic: bool

    # `num_features`: the number of outputs for a fully connected layer
    # or the number of output channels for a convolutional layer.
    # `num_dims`: 2 for a fully connected layer and 4 for a convolutional layer
    # Use `deterministic` to determine whether the current mode is training
    # mode or prediction mode
    def __init__(self, num_features, num_dims, deterministic=False):
        self.deterministic = deterministic
        if num_dims == 2:
            shape = (1, num_features)
        else:
            shape = (1, 1, 1, num_features)

        # The scale parameter and the shift parameter (model parameters) are
        # initialized to 1 and 0, respectively
        self.gamma = nnx.Param(jnp.ones(shape))
        self.beta = nnx.Param(jnp.zeros(shape))

        # The variables that are not model parameters are initialized to 0 and
        # 1. Save them to the 'batch_stats' collection
        self.moving_mean = nnx.BatchStat(jnp.zeros(shape))
        self.moving_var = nnx.BatchStat(jnp.ones(shape))

    def set_view(self, *, deterministic):
        self.deterministic = deterministic

    def __call__(self, X):
        return batch_norm(X, self.deterministic, self.gamma, self.beta,
                          self.moving_mean, self.moving_var,
                          eps=1e-5, momentum=0.9)

LeNet + BatchNorm

Drop a BatchNorm layer between each conv/linear and its activation:

class BNLeNetScratch(d2l.Classifier):
    def __init__(self, lr=0.1, num_classes=10, rngs=None):
        super().__init__()
        self.save_hyperparameters(ignore=['rngs'])
        rngs = nnx.Rngs(d2l.get_key()) if rngs is None else rngs
        self.net = nnx.Sequential(
            nnx.Conv(1, 6, kernel_size=(5, 5), rngs=rngs),
            BatchNorm(6, num_dims=4),
            nnx.sigmoid,
            lambda x: nnx.avg_pool(x, window_shape=(2, 2), strides=(2, 2)),
            nnx.Conv(6, 16, kernel_size=(5, 5), rngs=rngs),
            BatchNorm(16, num_dims=4),
            nnx.sigmoid,
            lambda x: nnx.avg_pool(x, window_shape=(2, 2), strides=(2, 2)),
            lambda x: x.reshape((x.shape[0], -1)),
            nnx.Linear(7 * 7 * 16, 120, rngs=rngs),
            BatchNorm(120, num_dims=2), nnx.sigmoid,
            nnx.Linear(120, 84, rngs=rngs),
            BatchNorm(84, num_dims=2), nnx.sigmoid,
            nnx.Linear(84, num_classes, rngs=rngs))

Train

Trains noticeably faster than vanilla LeNet — same accuracy in fewer epochs:

trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128)
model = BNLeNetScratch(lr=0.1)
trainer.fit(model, data)

After training, gamma and beta are non-trivial — the layer learned the scale/shift it wants:

model.net.layers[1].gamma[...].reshape((-1,)), \
model.net.layers[1].beta[...].reshape((-1,))

(Array([2.6707244, 1.8966421, 1.8147523, 1.6768903, 1.9702979, 1.7232087],      dtype=float32),
 Array([ 1.2342167 ,  0.5609814 , -0.26910412,  0.81538904,  0.20941882,
         0.6389893 ], dtype=float32))

The framework version

nn.BatchNorm2d for conv layers, nn.BatchNorm1d for linear layers — same idea, much faster, handles the eval/training mode switch automatically:

class BNLeNet(d2l.Classifier):
    def __init__(self, lr=0.1, num_classes=10, rngs=None):
        super().__init__()
        self.save_hyperparameters(ignore=['rngs'])
        rngs = nnx.Rngs(d2l.get_key()) if rngs is None else rngs
        # Flax's default momentum=0.99 decays the OLD running stats; PT/MX use
        # momentum=0.1 on the NEW stats, i.e. decay-of-OLD = 0.9. Pass 0.9 to
        # match the other tabs.
        self.net = nnx.Sequential(
            nnx.Conv(1, 6, kernel_size=(5, 5), rngs=rngs),
            nnx.BatchNorm(6, momentum=0.9, rngs=rngs), nnx.sigmoid,
            lambda x: nnx.avg_pool(x, window_shape=(2, 2), strides=(2, 2)),
            nnx.Conv(6, 16, kernel_size=(5, 5), rngs=rngs),
            nnx.BatchNorm(16, momentum=0.9, rngs=rngs), nnx.sigmoid,
            lambda x: nnx.avg_pool(x, window_shape=(2, 2), strides=(2, 2)),
            lambda x: x.reshape((x.shape[0], -1)),
            nnx.Linear(7 * 7 * 16, 120, rngs=rngs),
            nnx.BatchNorm(120, momentum=0.9, rngs=rngs), nnx.sigmoid,
            nnx.Linear(120, 84, rngs=rngs),
            nnx.BatchNorm(84, momentum=0.9, rngs=rngs), nnx.sigmoid,
            nnx.Linear(84, num_classes, rngs=rngs))

trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128)
model = BNLeNet(lr=0.1)
trainer.fit(model, data)

The two real problems

BatchNorm couples examples through the batch statistics:

Minibatch coupling: the estimates degrade as batches shrink. Detection / segmentation train at 1–2 images per device.
Train/serve discrepancy: minibatch statistics in training, moving averages at prediction. One layer, two functions.

GroupNorm (Wu & He, 2018): standardize each example within groups of channels. No batch in the statistics: batch size 1 works, and training equals serving.

GroupNorm in code

Per-(example, group) mean 0 / variance 1, at any batch size:

X = jax.random.normal(jax.random.PRNGKey(0), (32, 8, 8, 16))  # Channels last
gn = nnx.GroupNorm(16, num_groups=4, rngs=nnx.Rngs(1))
Y = gn(X)
# Collect each (example, group) pair's elements
G = jnp.transpose(Y, (0, 3, 1, 2)).reshape(32, 4, -1)
(jnp.abs(G.mean(axis=2)).max(), G.var(axis=2).mean(),
 jnp.allclose(gn(X[:1]), Y[:1], atol=1e-6))

(Array(2.6077032e-08, dtype=float32),
 Array(0.999999, dtype=float32),
 Array(True, dtype=bool))

The default in detection / segmentation heads and diffusion U-Nets, where per-device batches are small.

Recap

BatchNorm normalizes activations to zero mean / unit variance within each minibatch, then rescales with learned \gamma, \beta.
Track running statistics during training; use them at inference (no minibatch at test time).
Enables much deeper networks, higher LRs, faster convergence; mildly regularizing.
Spawned a family — LayerNorm (per-example, used in Transformers), GroupNorm, InstanceNorm.