from d2l import tensorflow as d2l
import tensorflow as tfA single RNN layer is already deep in time — but within one time step, input-to-output is just one nonlinearity.
Stacking RNN layers makes the model deep along the layer axis too. Each layer sees the previous layer’s hidden states as its input sequence; topmost layer feeds the readout.
\mathbf{H}_t^{(l)} = \phi_l(\mathbf{H}_t^{(l-1)} \mathbf{W}_{xh}^{(l)} + \mathbf{H}_{t-1}^{(l)} \mathbf{W}_{hh}^{(l)} + \mathbf{b}_h^{(l)}).
Typical sizes: width 64–2048, depth 1–8.
Layer l at time t depends on layer l at time t{-}1 and layer l{-}1 at time t.
A StackedRNNScratch is just a list of RNNScratch cells — each layer’s input width is num_hiddens (except the bottom layer, which sees raw inputs):
The forward pass walks the layers, feeding each layer’s output sequence into the next:
Two-layer stack on The Time Machine. Lower learning rate (lr=2) — deeper recurrents are harder to optimize:
data = d2l.TimeMachine(batch_size=1024, num_steps=32)
with d2l.try_gpu():
rnn_block = StackedRNNScratch(num_inputs=len(data.vocab),
num_hiddens=32, num_layers=2)
model = d2l.RNNLMScratch(rnn_block, vocab_size=len(data.vocab), lr=2)
trainer = d2l.Trainer(max_epochs=100, gradient_clip_val=1)
trainer.fit(model, data)nn.GRU(..., num_layers=L, dropout=p) collapses the stack into one library call — and adds dropout between layers, which is the standard regularizer for stacked RNNs:
class GRU(d2l.RNN):
"""The multilayer GRU model."""
def __init__(self, num_hiddens, num_layers, dropout=0):
d2l.Module.__init__(self)
self.save_hyperparameters()
gru_cells = [tf.keras.layers.GRUCell(num_hiddens, dropout=dropout)
for _ in range(num_layers)]
self.rnn = tf.keras.layers.RNN(gru_cells, return_sequences=True,
return_state=True)
def forward(self, X, state=None):
outputs, *state = self.rnn(tf.transpose(X, perm=[1, 0, 2]), state)
state = [s[0] if isinstance(s, list) else s for s in state]
return tf.transpose(outputs, perm=[1, 0, 2]), stateTwo-layer GRU LM, same Trainer:
nn.GRU(..., num_layers=L, dropout=p) is the production one-liner.