Sentiment Analysis: Using Convolutional Neural Networks

textCNN

textCNN (Kim, 2014) — a 1D conv net for sentiment. Different architecture, same task as the RNN deck.

Why CNNs on text? Each filter is a learned n-gram detector. Run several filter widths in parallel (3, 4, 5 words) for multi-scale coverage. Max-over-time pool collapses position; concat → linear → softmax. Fast, strong, parallelizable.

Pipeline

GloVe → 1D conv filters of varying widths → max-pool → classifier.

Setup

from d2l import tensorflow as d2l
import tensorflow as tf
import keras
import numpy as np

batch_size = 64
train_iter, test_iter, vocab = d2l.load_data_imdb(batch_size)
# d2l.load_array uses shuffle(buffer_size=1000), which is too small for
# the IMDb training set (25000 examples ordered as 12500 positives then
# 12500 negatives). Reshuffle the full dataset so each epoch sees a
# properly mixed class distribution, matching the PyTorch/JAX behavior.
train_iter = (train_iter.unbatch()
              .shuffle(25000, reshuffle_each_iteration=True)
              .batch(batch_size))

1D convolution

Sliding kernel over a 1D sequence. Output element = elementwise multiply + sum of an n-token window:

1D conv: kernel (1, 2) slides over input; first output is 0 \cdot 1 + 1 \cdot 2 = 2.

def corr1d(X, K):
    w = K.shape[0]
    Y = [tf.reduce_sum(X[i: i + w] * K) for i in range(X.shape[0] - w + 1)]
    return tf.stack(Y)

Multi-channel 1D conv

Embedding dim = input channels. Kernel has the same channel count; output is single-channel (or multi if you have multiple kernels).

3-channel 1D conv.

X, K = d2l.tensor([0, 1, 2, 3, 4, 5, 6]), d2l.tensor([1, 2])
corr1d(X, K)

<tf.Tensor: shape=(6,), dtype=int32, numpy=array([ 2,  5,  8, 11, 14, 17], dtype=int32)>

Equivalent 2D-conv view

Equivalent to a 2D conv with kernel height = input height:

def corr1d_multi_in(X, K):
    # First, iterate through the 0th dimension (channel dimension) of `X` and
    # `K`. Then, add them together
    return sum(corr1d(x, k) for x, k in zip(X, K))

X = d2l.tensor([[0, 1, 2, 3, 4, 5, 6],
              [1, 2, 3, 4, 5, 6, 7],
              [2, 3, 4, 5, 6, 7, 8]])
K = d2l.tensor([[1, 2], [3, 4], [-1, -3]])
corr1d_multi_in(X, K)

<tf.Tensor: shape=(6,), dtype=int32, numpy=array([ 2,  8, 14, 20, 26, 32], dtype=int32)>

Max-over-time pooling

Take the max over the time axis for each filter. Resulting feature is independent of where in the sequence the n-gram appeared. One scalar per filter, regardless of sentence length:

Max-over-time = max along the sequence axis.

textCNN model

Embedding (frozen GloVe + a fine-tunable copy) → parallel 1D convs at widths 3, 4, 5 → max-over-time → concat → dropout → linear:

class TextCNN(d2l.Classifier):
    def __init__(self, vocab_size, embed_size, kernel_sizes, num_channels,
                 **kwargs):
        super().__init__(**kwargs)
        self.embedding = keras.layers.Embedding(vocab_size, embed_size)
        # The embedding layer not to be trained
        self.constant_embedding = keras.layers.Embedding(vocab_size, embed_size)
        self.dropout = keras.layers.Dropout(0.5)
        self.decoder = keras.layers.Dense(2)
        # Create multiple one-dimensional convolutional layers
        self.convs = [keras.layers.Conv1D(c, k, activation='relu')
                      for c, k in zip(num_channels, kernel_sizes)]
        self.pool = keras.layers.GlobalMaxPooling1D()

    def call(self, inputs, training=False):
        # Concatenate two embedding layer outputs with shape
        # (batch_size, num_steps, 2 * embed_size) along the last axis
        embeddings = tf.concat(
            [self.embedding(inputs), self.constant_embedding(inputs)], axis=2)
        # For each convolutional layer, apply conv → global max pooling
        # and collect a (batch_size, num_channels) vector per kernel
        encoding = tf.concat(
            [self.pool(conv(embeddings)) for conv in self.convs], axis=1)
        outputs = self.decoder(self.dropout(encoding, training=training))
        return outputs

textCNN instance

The concrete model uses 100 channels at each kernel width. After max-over-time pooling, the classifier sees sum(num_channels) features, independent of review length.

embed_size, kernel_sizes, nums_channels = 100, [3, 4, 5], [100, 100, 100]
devices = d2l.try_all_gpus()
net = TextCNN(len(vocab), embed_size, kernel_sizes, nums_channels)
# Build the model by calling it once on a dummy input
dummy_input = tf.zeros((1, 500), dtype=tf.int32)
net(dummy_input)

<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[0.04802025, 0.01694743]], dtype=float32)>

Loading pretrained GloVe

Both embedding tables start from the same GloVe vectors: one stays fixed as a semantic anchor, the other is fine-tuned for sentiment-specific cues.

glove_embedding = d2l.TokenEmbedding('glove.6b.100d')
embeds = glove_embedding[vocab.idx_to_token]
net.embedding.set_weights([np.array(embeds)])
net.constant_embedding.set_weights([np.array(embeds)])
net.constant_embedding.trainable = False

Training

CNNs train fast because all windows are processed in parallel. Use the metric output to compare with the BiLSTM deck: similar accuracy, less sequential computation.

lr, num_epochs = 0.001, 5
net.compile(optimizer=keras.optimizers.Adam(lr),
            loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
            metrics=['accuracy'])
net.fit(train_iter, validation_data=test_iter, epochs=num_epochs)

Epoch 1/5
Final epoch metrics: accuracy: 0.5312 - loss: 2.2925
Final epoch metrics: accuracy: 0.5061 - loss: 1.6597
Final epoch metrics: accuracy: 0.5220 - loss: 1.4705
Final epoch metrics: accuracy: 0.5281 - loss: 1.3641
Final epoch metrics: accuracy: 0.5311 - loss: 1.2932
...
Final epoch metrics: accuracy: 0.9712 - loss: 0.0912
Final epoch metrics: accuracy: 0.9711 - loss: 0.0912
Final epoch metrics: accuracy: 0.9711 - loss: 0.0912
Final epoch metrics: accuracy: 0.9710 - loss: 0.0912

Final epoch metrics: accuracy: 0.9693 - loss: 0.0917 - val_accuracy: 0.8626 - val_loss: 0.3731

d2l.predict_sentiment(net, vocab, 'this movie is so great')

'positive'

d2l.predict_sentiment(net, vocab, 'this movie is so bad')

'negative'

Recap

textCNN = parallel 1D convs over word embeddings + max pooling + linear head.
Each filter learns an n-gram detector; different widths give multi-scale coverage.
Comparable accuracy to BiLSTM on IMDb at a fraction of the training time and zero recurrence.
The shape (parallel filter widths, pooled features) is the template for many text-classification CNNs that followed.