Recurrent Neural Networks

A recurrent neural network carries a hidden state \mathbf{h}_t across time steps, a learned summary of all input seen so far:

\mathbf{h}_t = \phi(\mathbf{W}_{xh}\mathbf{x}_t + \mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{b}).

Same weights at every step, so the parameter count is constant regardless of sequence length. Unbounded effective context (in principle), with no fixed-size window like an n-gram.

Unrolled in time

An RNN unrolled across three time steps; the same weights are reused at every step.

Setup

from d2l import jax as d2l
import jax
from jax import numpy as jnp

The recurrence in code

The naive form: two matrix multiplies, summed:

X, W_xh = jax.random.normal(d2l.get_key(), (3, 1)), jax.random.normal(
                                                        d2l.get_key(), (1, 4))
H, W_hh = jax.random.normal(d2l.get_key(), (3, 4)), jax.random.normal(
                                                        d2l.get_key(), (4, 4))
d2l.matmul(X, W_xh) + d2l.matmul(H, W_hh)

Array([[ 7.8358064 , -1.1775837 , -1.4894798 , -3.3220472 ],
       [ 6.549281  ,  1.0249726 ,  0.38784432, -2.247775  ],
       [ 3.1509278 ,  0.7940084 , -1.0252644 ,  0.93234503]],      dtype=float32)

Equivalently, concatenate input and hidden and multiply by the concatenated weight matrix. Same result, one matmul:

d2l.matmul(d2l.concat((X, H), 1), d2l.concat((W_xh, W_hh), 0))

Array([[ 7.8358064 , -1.1775837 , -1.4894797 , -3.322047  ],
       [ 6.549281  ,  1.0249726 ,  0.38784423, -2.247775  ],
       [ 3.1509278 ,  0.7940084 , -1.0252644 ,  0.9323451 ]],      dtype=float32)

The concatenate-then-multiply form is what most framework RNN implementations actually do.

Constant memory per step

\mathbf{h}_t has a fixed size, independent of t.
To form \mathbf{h}_t we need only \mathbf{h}_{t-1} and \mathbf{x}_t; everything older can be dropped.
Any-length input at flat per-step memory and compute.
The trade: the state is a lossy summary, so it must learn what to keep. Attention (later) keeps everything instead, at a cost that grows with length.

As a language model

Embedding maps a BPE token id to a vector \mathbf{x}_t.
RNN updates the hidden state \mathbf{h}_t.
Linear head projects \mathbf{h}_t to vocab logits; softmax gives P(x_{t+1} \mid x_{\le t}).
Loss = cross-entropy with the next-token target.

Teacher forcing

Targets are the inputs shifted forward by one token; each step predicts the next token.

Train on gold prefixes; generate on the model’s own outputs. That mismatch is the rollout-error problem again.

Recap

RNN: \mathbf{h}_t = \phi(\mathbf{W}_{xh}\mathbf{x}_t + \mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{b}).
Same parameters at every step; the hidden state is a fixed-size summary of the whole past.
Trains by backprop through time: gradients flow from \mathbf{h}_T back to every earlier hidden state.
Those long products vanish or explode on long sequences, fixed by LSTM and GRU in the next chapter.