Vanilla RNNs hit a ceiling: gradients vanish across long sequences. LSTMs (Hochreiter & Schmidhuber, 1997) fix this by giving each unit a memory cell with a self-loop of weight 1 and three learned gates.
Forget gate\mathbf{F}_t — keep or wipe memory.
Input gate\mathbf{I}_t — let new content in.
Output gate\mathbf{O}_t — expose or hide memory.
For two decades, the sequence model — speech, translation, language modeling — until Transformers took over (2017).
Gates at a glance
The input \mathbf{X}_t and previous hidden state \mathbf{H}_{t-1} feed three sigmoid gates.
If \mathbf{F}_t \approx 1 and \mathbf{I}_t \approx 0, the cell holds its value unchanged across arbitrary horizons. That’s the constant error carousel that fixes vanishing gradients.
Walk the sequence; at each step compute the four gate/node heads, update \mathbf{C}, then \mathbf{H}. Carry both states forward.
def forward(self, inputs, H_C=None):if H_C isNone:# Initial state with shape: (batch_size, num_hiddens) H = d2l.zeros((inputs.shape[1], self.num_hiddens), device=inputs.device) C = d2l.zeros((inputs.shape[1], self.num_hiddens), device=inputs.device)else: H, C = H_C outputs = []for X in inputs: I = d2l.sigmoid(d2l.matmul(X, self.W_xi) + d2l.matmul(H, self.W_hi) +self.b_i) F = d2l.sigmoid(d2l.matmul(X, self.W_xf) + d2l.matmul(H, self.W_hf) +self.b_f) O = d2l.sigmoid(d2l.matmul(X, self.W_xo) + d2l.matmul(H, self.W_ho) +self.b_o) C_tilde = d2l.tanh(d2l.matmul(X, self.W_xc) + d2l.matmul(H, self.W_hc) +self.b_c) C = F * C + I * C_tilde H = O * d2l.tanh(C) outputs.append(H)return outputs, (H, C)
Training the from-scratch LSTM
Same RNNLMScratch head, same Trainer, same gradient clipping — only the cell changed. Higher learning rate (lr=4) is fine because gates keep activations bounded.