BiRNN classifier

Sentiment Analysis: Using Recurrent Neural Networks

Sentiment RNN

Sentiment classification on IMDb: pretrained word vectors → bidirectional LSTM → linear head. Standard pre-Transformer text-classification recipe.

The encoder reads the review left-to-right and right-to-left; concatenated final hidden states feed a binary classifier. GloVe gives a strong initialization that the LSTM then specializes for sentiment.

Pipeline

GloVe embeddings → BiLSTM → output classifier.

Setup

from d2l import tensorflow as d2l
import tensorflow as tf
import keras
import numpy as np

batch_size = 64
train_iter, test_iter, vocab = d2l.load_data_imdb(batch_size)
# d2l.load_array uses shuffle(buffer_size=1000), which is too small for
# the IMDb training set (25000 examples ordered as 12500 positives then
# 12500 negatives). Reshuffle the full dataset so each epoch sees a
# properly mixed class distribution, matching the PyTorch/JAX behavior.
train_iter = (train_iter.unbatch()
              .shuffle(25000, reshuffle_each_iteration=True)
              .batch(batch_size))

Class definition: embedding -> bidirectional LSTM -> concatenate the first and last hidden states -> 2-way decoder. The decoder input has width 4h: two directions times two endpoint states.

class BiRNN(d2l.Classifier):
    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
                 **kwargs):
        super().__init__(**kwargs)
        self.embedding = keras.layers.Embedding(vocab_size, embed_size)
        # Stack bidirectional LSTM layers; all layers return the full
        # sequence so we can concatenate the initial- and final-step
        # hidden states downstream.
        self.encoder = keras.Sequential([
            keras.layers.Bidirectional(
                keras.layers.LSTM(num_hiddens, return_sequences=True))
            for _ in range(num_layers - 1)
        ] + [
            keras.layers.Bidirectional(
                keras.layers.LSTM(num_hiddens, return_sequences=True))
        ])
        self.decoder = keras.layers.Dense(2)

    def call(self, inputs, training=False):
        # inputs shape: (batch_size, num_steps)
        embeddings = self.embedding(inputs)
        # outputs shape: (batch_size, num_steps, 2 * num_hiddens)
        outputs = self.encoder(embeddings, training=training)
        # Concatenate hidden states at initial and final time steps
        # Shape: (batch_size, 4 * num_hiddens)
        encoding = tf.concat([outputs[:, 0, :], outputs[:, -1, :]], axis=1)
        outs = self.decoder(encoding)
        return outs

BiRNN instance

Instantiate a 2-layer BiLSTM with 100-dimensional embeddings and 100 hidden units. Frameworks initialize recurrent weights differently, but the model contract is the same:

embed_size, num_hiddens, num_layers, devices = 100, 100, 2, d2l.try_all_gpus()
net = BiRNN(len(vocab), embed_size, num_hiddens, num_layers)
# Build the model by calling it once on a dummy input
dummy_input = tf.zeros((1, 500), dtype=tf.int32)
net(dummy_input)
<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[ 0.01679529, -0.00268278]], dtype=float32)>

Loading pretrained GloVe

Use 100-dim GloVe vectors trained on Wikipedia + Gigaword. Initialize the embedding layer from them; freeze or fine-tune (we fine-tune):

glove_embedding = d2l.TokenEmbedding('glove.6b.100d')
embeds = glove_embedding[vocab.idx_to_token]
embeds.shape
TensorShape([49346, 100])
net.embedding.set_weights([np.array(embeds)])
net.embedding.trainable = False

Training

Standard cross-entropy + Adam. Watch validation accuracy, not just training loss; sentiment models overfit quickly on IMDb if the embedding and classifier are too large:

lr, num_epochs = 0.01, 5
net.compile(optimizer=keras.optimizers.Adam(lr),
            loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
            metrics=['accuracy'])
net.fit(train_iter, validation_data=test_iter, epochs=num_epochs)
Epoch 1/5
Final epoch metrics: accuracy: 0.5469 - loss: 0.6922
Final epoch metrics: accuracy: 0.5352 - loss: 0.9716
Final epoch metrics: accuracy: 0.5321 - loss: 1.0086
Final epoch metrics: accuracy: 0.5280 - loss: 1.0046
Final epoch metrics: accuracy: 0.5255 - loss: 0.9921
...
Final epoch metrics: accuracy: 0.8677 - loss: 0.3114
Final epoch metrics: accuracy: 0.8678 - loss: 0.3114
Final epoch metrics: accuracy: 0.8678 - loss: 0.3114
Final epoch metrics: accuracy: 0.8678 - loss: 0.3114

Final epoch metrics: accuracy: 0.8697 - loss: 0.3103 - val_accuracy: 0.8482 - val_loss: 0.3664
def predict_sentiment(net, vocab, sequence):
    """Predict the sentiment of a text sequence."""
    sequence = tf.constant(vocab[sequence.split()], dtype=tf.int32)
    sequence = tf.reshape(sequence, (1, -1))
    label = tf.argmax(net(sequence, training=False), axis=1)
    return 'positive' if int(label[0]) == 1 else 'negative'

Predict on new reviews

The final check should classify clearly positive and clearly negative synthetic reviews differently. This is not a full evaluation, but it catches label/order mistakes in the pipeline.

predict_sentiment(net, vocab, 'this movie is so great')
'positive'
predict_sentiment(net, vocab, 'this movie is so bad')
'negative'

Recap

  • BiLSTM-on-GloVe: a strong pre-Transformer baseline for text classification.
  • Pretrained embeddings carry general-purpose word semantics; LSTM specializes for sentiment.
  • Easily beaten today by fine-tuned BERT, but a clean template for sequence-to-label tasks more broadly.