Parameter Management

Managing Parameters

A neural network is a tree of parameters — the weight matrices and bias vectors gradient descent updates. Training is one thing you do with them; this deck covers the others.

  • Inspection — debug, sanity-check init, view features.
  • Iteration — optimizers, weight decay, checkpointing all walk every parameter.
  • Sharing (“tying”) — make two layers refer to the same tensor (tied embeddings, autoencoders).

The parameter tree

A nested module is just a tree. Each module is a node; each parameter is a leaf:

net  (Sequential)
├─ 0: Linear      ├─ weight  (8, 4)
│                 └─ bias    (8,)
├─ 1: ReLU         (no params)
└─ 2: Linear      ├─ weight  (1, 8)
                  └─ bias    (1,)

Two access patterns:

  • By path: net[2].weight — direct.
  • By traversal: walk the tree, yield every leaf.

Frameworks give you both, plus serialization built on the same traversal.

A toy model

import torch
from torch import nn
net = nn.Sequential(nn.LazyLinear(8),
                    nn.ReLU(),
                    nn.LazyLinear(1))

X = torch.rand(size=(2, 4))
net(X).shape
torch.Size([2, 1])

Direct access

Index into a Sequential like a list; each layer exposes its parameters as attributes:

net[2].state_dict()
OrderedDict([('weight',
              tensor([[-0.1232,  0.1454,  0.2363, -0.3100, -0.1172, -0.2252, -0.2725,  0.2280]])),
             ('bias', tensor([0.3460]))])

Two parameters per Linear layer — weight matrix and bias vector. The output object is a Parameter (PyTorch) or similar wrapper that carries the tensor + gradient + extra metadata.

Tensor inside the parameter

.data (PyTorch) unwraps the parameter to a plain tensor for inspection:

type(net[2].bias), net[2].bias.data
(torch.nn.parameter.Parameter, tensor([0.3460]))

.grad is the gradient buffer — populated by backward(), otherwise None. Useful for custom optimizers or diagnosing dead neurons:

net[2].weight.grad == None
True

Recursive traversal

For everything-at-once, use named_parameters(). It walks the whole tree and yields (name, param) pairs at the leaves — names use dotted paths through the nesting:

[(name, param.shape) for name, param in net.named_parameters()]
[('0.weight', torch.Size([8, 4])),
 ('0.bias', torch.Size([8])),
 ('2.weight', torch.Size([1, 8])),
 ('2.bias', torch.Size([1]))]

This is the iterator optim.SGD(net.parameters(), …) consumes. It’s also what gets pickled when you save a checkpoint with state_dict(). Walk-tree-once, use many ways.

Parameter tying

Reuse the same module instance at multiple positions in your architecture, and the framework treats them as one parameter set — same memory, gradients accumulate across uses.

Common cases:

  • Tied embeddings: input embedding and output softmax projection in a language model share weights — saves |V| \cdot d parameters.
  • Autoencoders: decoder uses transposed encoder weights.
  • Recurrent layers: same kernel applied at every time step (the original tying mechanism).
# We need to give the shared layer a name so that we can refer to its
# parameters
shared = nn.LazyLinear(8)
net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(),
                    shared, nn.ReLU(),
                    shared, nn.ReLU(),
                    nn.LazyLinear(1))

net(X)
# Check whether the parameters are the same object (tied, not just equal)
assert net[2].weight is net[4].weight
net[2].weight.data[0, 0] = 100
# Modifying one affects the other since they share the same tensor
assert net[2].weight.data[0, 0] == net[4].weight.data[0, 0]

Modify net[2].weight and net[4].weight reflects the same change — they are the same tensor, not just equal.

Recap

  • A module is a tree; parameters live at the leaves.
  • Direct access: net[i].weight, .bias, .grad.
  • Recursive traversal: named_parameters() / state_dict() walks the whole tree.
  • Same iterator powers optimizers, weight decay, checkpointing.
  • Tied parameters = reuse the same module instance — gradients accumulate; one buffer in memory.