Lazy Initialization

Lazy Initialization

Lazy initialization lets you declare a layer’s output size without specifying its input size:

nn.LazyLinear(256)   # only num_outputs!

The framework defers allocating the weight tensor until the first forward pass — when it has seen real data and can infer shapes from the upstream output.

In old frameworks: nn.Linear(in_features=20, out_features=256). Now: nn.LazyLinear(256) — less arithmetic, fewer bugs when you change the architecture.

The cascade

declare layer  -->  shapes UNKNOWN, no params yet
       │
       │  nn.LazyLinear(256)
       ▼
declare model  -->  same — placeholders
       │
       │  net = Sequential(...)
       ▼
forward(X)     -->  X.shape known → infer first layer
       │              first layer output → second layer input
       │              ... cascade through the model
       ▼
parameters allocated, model usable, optimizer can see them

Why this matters more than it seems

Hand-counting input dims is painful in real architectures:

  • A CNN’s flattened feature map depends on the input image size and every previous layer’s stride/padding.
  • Adding a layer in the middle changes every following layer’s in_features.
  • Variable-length sequences (RNNs, Transformers) make shapes data-dependent.

Pre-lazy code was full of 16 \cdot 5 \cdot 5 = 400 “compute the flatten size by hand” comments. Lazy init removes that bookkeeping — declare outputs, let inputs come from data.

Setup

from d2l import torch as d2l
import torch
from torch import nn
net = nn.Sequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))

Before forward: no parameters yet

Inspect the first layer’s weight: it’s a placeholder, not an allocated tensor:

net[0].weight
<UninitializedParameter>

The framework has registered the intent to create a weight, but can’t allocate one until it sees the input shape.

One forward pass materializes everything

Pass any tensor through. Now the framework knows X.shape == (2, 20) → first layer is Linear(20, 256) → second layer’s input is 256 → second is Linear(256, 10):

X = torch.rand(2, 20)
net(X)

net[0].weight.shape
torch.Size([256, 20])

After this, every layer has concrete weight and bias you can inspect, save, optimize.

Tying lazy init to a custom initializer

The trick combines naturally with custom init: do the forward to materialize, then run your initializer:

@d2l.add_to_class(d2l.Module)
def apply_init(self, inputs, init=None):
    self.forward(*inputs)
    if init is not None:
        self.net.apply(init)

This is what d2l.Module.apply_init(...) does behind the scenes. The same pattern works for loading pretrained weights, swapping random init for a curated one, etc.

Recap

  • Lazy init: declare layer outputs, let inputs come from data.
  • Parameter buffers are allocated on the first forward pass after seeing the input shape.
  • Saves you from hand-computing in_features for every layer in deep / variable-shape architectures.
  • Combine with custom initialization by doing one dummy forward, then apply_init.
  • Limitations: can’t optim.SGD(net.parameters()) until parameters exist — pass data once first.