from d2l import mxnet as d2l
from mxnet import np, npx
from mxnet.gluon import rnn
npx.set_np()A single RNN layer is already deep in time — but within one time step, input-to-output is just one nonlinearity.
Stacking RNN layers makes the model deep along the layer axis too. Each layer sees the previous layer’s hidden states as its input sequence; topmost layer feeds the readout.
\mathbf{H}_t^{(l)} = \phi_l(\mathbf{H}_t^{(l-1)} \mathbf{W}_{xh}^{(l)} + \mathbf{H}_{t-1}^{(l)} \mathbf{W}_{hh}^{(l)} + \mathbf{b}_h^{(l)}).
Typical sizes: width 64–2048, depth 1–8.
Layer l at time t depends on layer l at time t{-}1 and layer l{-}1 at time t.
A StackedRNNScratch is just a list of RNNScratch cells — each layer’s input width is num_hiddens (except the bottom layer, which sees raw inputs):
The forward pass walks the layers, feeding each layer’s output sequence into the next:
Two-layer stack on The Time Machine. Lower learning rate (lr=2) — deeper recurrents are harder to optimize:
data = d2l.TimeMachine(batch_size=1024, num_steps=32)
rnn_block = StackedRNNScratch(num_inputs=len(data.vocab),
num_hiddens=32, num_layers=2)
model = d2l.RNNLMScratch(rnn_block, vocab_size=len(data.vocab), lr=2)
trainer = d2l.Trainer(max_epochs=100, gradient_clip_val=1, num_gpus=1)
trainer.fit(model, data)nn.GRU(..., num_layers=L, dropout=p) collapses the stack into one library call — and adds dropout between layers, which is the standard regularizer for stacked RNNs:
Two-layer GRU LM, same Trainer:
Decode from a prefix:
nn.GRU(..., num_layers=L, dropout=p) is the production one-liner.