BiRNN classifier

Sentiment Analysis: Using Recurrent Neural Networks

Sentiment RNN

Sentiment classification on IMDb: pretrained word vectors → bidirectional LSTM → linear head. Standard pre-Transformer text-classification recipe.

The encoder reads the review left-to-right and right-to-left; concatenated final hidden states feed a binary classifier. GloVe gives a strong initialization that the LSTM then specializes for sentiment.

Pipeline

GloVe embeddings → BiLSTM → output classifier.

Setup

from d2l import torch as d2l
import torch
from torch import nn

batch_size = 64
train_iter, test_iter, vocab = d2l.load_data_imdb(batch_size)

Class definition: embedding -> bidirectional LSTM -> concatenate the first and last hidden states -> 2-way decoder. The decoder input has width 4h: two directions times two endpoint states.

class BiRNN(nn.Module):
    def __init__(self, vocab_size, embed_size, num_hiddens,
                 num_layers, **kwargs):
        super(BiRNN, self).__init__(**kwargs)
        self.embedding = nn.Embedding(vocab_size, embed_size)
        # Set `bidirectional` to True to get a bidirectional RNN
        self.encoder = nn.LSTM(embed_size, num_hiddens, num_layers=num_layers,
                                bidirectional=True)
        self.decoder = nn.Linear(4 * num_hiddens, 2)

    def forward(self, inputs):
        # The shape of `inputs` is (batch size, no. of time steps). Because
        # LSTM requires its input's first dimension to be the temporal
        # dimension, the input is transposed before obtaining token
        # representations. The output shape is (no. of time steps, batch size,
        # word vector dimension)
        embeddings = self.embedding(inputs.T)
        self.encoder.flatten_parameters()
        # Returns hidden states of the last hidden layer at different time
        # steps. The shape of `outputs` is (no. of time steps, batch size,
        # 2 * no. of hidden units)
        outputs, _ = self.encoder(embeddings)
        # Concatenate the hidden states at the initial and final time steps as
        # the input of the fully connected layer. Its shape is (batch size,
        # 4 * no. of hidden units)
        encoding = torch.cat((outputs[0], outputs[-1]), dim=1)
        outs = self.decoder(encoding)
        return outs

BiRNN instance

Instantiate a 2-layer BiLSTM with 100-dimensional embeddings and 100 hidden units. Frameworks initialize recurrent weights differently, but the model contract is the same:

embed_size, num_hiddens, num_layers, devices = 100, 100, 2, d2l.try_all_gpus()
net = BiRNN(len(vocab), embed_size, num_hiddens, num_layers)
def init_weights(module):
    if type(module) == nn.Linear:
        nn.init.xavier_uniform_(module.weight)
    if type(module) == nn.LSTM:
        for param in module._flat_weights_names:
            if "weight" in param:
                nn.init.xavier_uniform_(module._parameters[param])
net.apply(init_weights);

Loading pretrained GloVe

Use 100-dim GloVe vectors trained on Wikipedia + Gigaword. Initialize the embedding layer from them; freeze or fine-tune (we fine-tune):

glove_embedding = d2l.TokenEmbedding('glove.6b.100d')
embeds = glove_embedding[vocab.idx_to_token]
embeds.shape
torch.Size([49346, 100])
net.embedding.weight.data.copy_(embeds)
net.embedding.weight.requires_grad = False

Training

Standard cross-entropy + Adam. Watch validation accuracy, not just training loss; sentiment models overfit quickly on IMDb if the embedding and classifier are too large:

lr, num_epochs = 0.01, 5
trainer = torch.optim.Adam(net.parameters(), lr=lr)
loss = nn.CrossEntropyLoss(reduction="none")
d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices)

loss 0.259, train acc 0.895, test acc 0.857
4782.3 examples/sec on [device(type='cuda', index=0)]
def predict_sentiment(net, vocab, sequence):
    """Predict the sentiment of a text sequence."""
    sequence = torch.tensor(vocab[sequence.split()], device=d2l.try_gpu())
    label = torch.argmax(net(sequence.reshape(1, -1)), dim=1)
    return 'positive' if label == 1 else 'negative'

Predict on new reviews

The final check should classify clearly positive and clearly negative synthetic reviews differently. This is not a full evaluation, but it catches label/order mistakes in the pipeline.

predict_sentiment(net, vocab, 'this movie is so great')
'positive'
predict_sentiment(net, vocab, 'this movie is so bad')
'negative'

Recap

  • BiLSTM-on-GloVe: a strong pre-Transformer baseline for text classification.
  • Pretrained embeddings carry general-purpose word semantics; LSTM specializes for sentiment.
  • Easily beaten today by fine-tuned BERT, but a clean template for sequence-to-label tasks more broadly.