Parameter Management

Managing Parameters

A neural network is a tree of parameters — the weight matrices and bias vectors gradient descent updates. Training is one thing you do with them; this deck covers the others.

Inspection — debug, sanity-check init, view features.
Iteration — optimizers, weight decay, checkpointing all walk every parameter.
Sharing (“tying”) — make two layers refer to the same tensor (tied embeddings, autoencoders).

The parameter tree

A nested module is just a tree. Each module is a node; each parameter is a leaf:

net  (Sequential)
├─ 0: Linear      ├─ weight  (8, 4)
│                 └─ bias    (8,)
├─ 1: ReLU         (no params)
└─ 2: Linear      ├─ weight  (1, 8)
                  └─ bias    (1,)

Two access patterns:

By path: net[2].weight — direct.
By traversal: walk the tree, yield every leaf.

Frameworks give you both, plus serialization built on the same traversal.

A toy model

import tensorflow as tf

net = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(8, activation=tf.nn.relu),
    tf.keras.layers.Dense(1),
])

X = tf.random.uniform((2, 4))
net(X).shape

TensorShape([2, 1])

Direct access

Index into a Sequential like a list; each layer exposes its parameters as attributes:

net.layers[2].weights

[<Variable path=sequential/dense_1/kernel, shape=(8, 1), dtype=float32, value=[[ 0.15129125]
  [ 0.3634373 ]
  [-0.5235572 ]
  [-0.43507627]
  [-0.41616136]
  [ 0.6064105 ]
  [-0.5269911 ]
  [ 0.38409138]]>,
 <Variable path=sequential/dense_1/bias, shape=(1,), dtype=float32, value=[0.]>]

Two parameters per Linear layer — weight matrix and bias vector. The output object is a Parameter (PyTorch) or similar wrapper that carries the tensor + gradient + extra metadata.

Tensor inside the parameter

.data (PyTorch) unwraps the parameter to a plain tensor for inspection:

type(net.layers[2].weights[1]), tf.convert_to_tensor(net.layers[2].weights[1])

(keras.src.backend.Variable,
 <tf.Tensor: shape=(1,), dtype=float32, numpy=array([0.], dtype=float32)>)

.grad is the gradient buffer — populated by backward(), otherwise None. Useful for custom optimizers or diagnosing dead neurons:

Recursive traversal

For everything-at-once, use named_parameters(). It walks the whole tree and yields (name, param) pairs at the leaves — names use dotted paths through the nesting:

net.get_weights()

[array([[ 0.16767919,  0.3248908 ,  0.5219191 ,  0.20520914,  0.47057313,
         -0.1526336 ,  0.6161278 , -0.5017144 ],
        [ 0.3268463 ,  0.52250785,  0.04043335,  0.6704082 , -0.22528923,
          0.40379363, -0.63631016, -0.36722666],
        [ 0.52791613, -0.19134533, -0.5103644 ,  0.40397602,  0.18622148,
          0.47954518,  0.20551544,  0.28378856],
...
        [-0.43507627],
        [-0.41616136],
        [ 0.6064105 ],
        [-0.5269911 ],
        [ 0.38409138]], dtype=float32),
 array([0.], dtype=float32)]

This is the iterator optim.SGD(net.parameters(), …) consumes. It’s also what gets pickled when you save a checkpoint with state_dict(). Walk-tree-once, use many ways.

Parameter tying

Reuse the same module instance at multiple positions in your architecture, and the framework treats them as one parameter set — same memory, gradients accumulate across uses.

Common cases:

Tied embeddings: input embedding and output softmax projection in a language model share weights — saves |V| \cdot d parameters.
Autoencoders: decoder uses transposed encoder weights.
Recurrent layers: same kernel applied at every time step (the original tying mechanism).

# Keras keeps both references to the shared layer in net.layers,
# but the shared layer's parameters are tied
shared = tf.keras.layers.Dense(8, activation=tf.nn.relu)
net = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(8, activation=tf.nn.relu),
    shared,
    shared,
    tf.keras.layers.Dense(1),
])

net(X)
# Check whether the parameters are the same object
print(net.layers[2].weights[0] is net.layers[3].weights[0])

True

Modify net[2].weight and net[4].weight reflects the same change — they are the same tensor, not just equal.

Recap

A module is a tree; parameters live at the leaves.
Direct access: net[i].weight, .bias, .grad.
Recursive traversal: named_parameters() / state_dict() walks the whole tree.
Same iterator powers optimizers, weight decay, checkpointing.
Tied parameters = reuse the same module instance — gradients accumulate; one buffer in memory.