Lazy Initialization

Lazy initialization lets you declare a layer’s output size without specifying its input size:

nn.LazyLinear(256)   # only num_outputs!

The framework defers allocating the weight tensor until the first forward pass — when it has seen real data and can infer shapes from the upstream output.

In old frameworks: nn.Linear(in_features=20, out_features=256). Now: nn.LazyLinear(256) — less arithmetic, fewer bugs when you change the architecture.

The cascade

declare layer  -->  shapes UNKNOWN, no params yet
       │
       │  nn.LazyLinear(256)
       ▼
declare model  -->  same — placeholders
       │
       │  net = Sequential(...)
       ▼
forward(X)     -->  X.shape known → infer first layer
       │              first layer output → second layer input
       │              ... cascade through the model
       ▼
parameters allocated, model usable, optimizer can see them

Why this matters more than it seems

Hand-counting input dims is painful in real architectures:

A CNN’s flattened feature map depends on the input image size and every previous layer’s stride/padding.
Adding a layer in the middle changes every following layer’s in_features.
Variable-length sequences (RNNs, Transformers) make shapes data-dependent.

Pre-lazy code was full of 16 \cdot 5 \cdot 5 = 400 “compute the flatten size by hand” comments. Lazy init removes that bookkeeping — declare outputs, let inputs come from data.

Setup

import tensorflow as tf

net = tf.keras.models.Sequential([
    tf.keras.layers.Dense(256, activation=tf.nn.relu),
    tf.keras.layers.Dense(10),
])

Before forward: no parameters yet

Inspect the first layer’s weight: it’s a placeholder, not an allocated tensor:

[net.layers[i].get_weights() for i in range(len(net.layers))]

[[], []]

The framework has registered the intent to create a weight, but can’t allocate one until it sees the input shape.

One forward pass materializes everything

Pass any tensor through. Now the framework knows X.shape == (2, 20) → first layer is Linear(20, 256) → second layer’s input is 256 → second is Linear(256, 10):

X = tf.random.uniform((2, 20))
net(X)
[w.shape for w in net.get_weights()]

[(20, 256), (256,), (256, 10), (10,)]

After this, every layer has concrete weight and bias you can inspect, save, optimize.

Tying lazy init to a custom initializer

The trick combines naturally with custom init: do the forward to materialize, then run your initializer:

This is what d2l.Module.apply_init(...) does behind the scenes. The same pattern works for loading pretrained weights, swapping random init for a curated one, etc.

Recap

Lazy init: declare layer outputs, let inputs come from data.
Parameter buffers are allocated on the first forward pass after seeing the input shape.
Saves you from hand-computing in_features for every layer in deep / variable-shape architectures.
Combine with custom initialization by doing one dummy forward, then apply_init.
Limitations: can’t optim.SGD(net.parameters()) until parameters exist — pass data once first.