Concise Implementation of Recurrent Neural Networks

Concise RNNs

The same character-level LM, using the framework’s built-in nn.RNN. The cell + unroll + projection from scratch boil down to a few lines:

nn.RNN(input_size, hidden_size) handles the recurrence, including hardware-accelerated cuDNN kernels on GPU.
Reuse the RNNLMScratch head — it doesn’t care whether the cell is hand-rolled.
Same Trainer, same gradient clipping, same data.

End result: faster training, ~5× fewer lines of code, identical mathematics.

The model

Built-in RNN cell + handing off the rest of the LM scaffold to the from-scratch base class:

from d2l import mxnet as d2l
from mxnet import np, npx
from mxnet.gluon import nn, rnn
npx.set_np()

class RNN(d2l.Module):
    """The RNN model implemented with high-level APIs."""
    def __init__(self, num_hiddens):
        super().__init__()
        self.save_hyperparameters()        
        self.rnn = rnn.RNN(num_hiddens)
        
    def forward(self, inputs, H=None):
        if H is None:
            H, = self.rnn.begin_state(inputs.shape[1], ctx=inputs.ctx)
        outputs, (H, ) = self.rnn(inputs, (H, ))
        return outputs, H

class RNNLM(d2l.RNNLMScratch):
    """The RNN-based language model implemented with high-level APIs."""
    def init_params(self):
        self.linear = nn.Dense(self.vocab_size, flatten=False)
        self.initialize()
    def output_layer(self, hiddens):
        return d2l.swapaxes(self.linear(hiddens), 0, 1)

Sanity check

Untrained model still runs — predictions are random characters, but shapes line up. This check isolates API wiring from learning quality:

data = d2l.TimeMachine(batch_size=1024, num_steps=32)
rnn = RNN(num_hiddens=32)
model = RNNLM(rnn, vocab_size=len(data.vocab), lr=1)
model.predict('it has', 20, data.vocab)

Training and decoding

Same Trainer, with gradient_clip_val=1 on the optimizer:

trainer = d2l.Trainer(max_epochs=100, gradient_clip_val=1, num_gpus=1)
trainer.fit(model, data)

ppl = float(model.board.data['val_ppl'][-1].y)
pred = model.predict('time traveller', 20, data.vocab, d2l.try_gpu())
print(f'perplexity {ppl:.1f}, {pred!r}')

Output looks like simple English-shaped text — same character- level statistics the from-scratch version learned, in much less training time.

Recap

nn.RNN is the cell + unroll + (with cuDNN) GPU kernels in one stock layer.
Reuse the from-scratch LM wrapper — only the cell changes.
Same scaffold accepts nn.LSTM, nn.GRU, etc. — drop-in replacements with better long-range gradient behavior.
The framework version trains noticeably faster than the from-scratch version on the same hardware.