from d2l import torch as d2l
import torch
from torch import nn
batch_size = 64
train_iter, test_iter, vocab = d2l.load_data_imdb(batch_size)textCNN (Kim, 2014) — a 1D conv net for sentiment. Different architecture, same task as the RNN deck.
Why CNNs on text? Each filter is a learned n-gram detector. Run several filter widths in parallel (3, 4, 5 words) for multi-scale coverage. Max-over-time pool collapses position; concat → linear → softmax. Fast, strong, parallelizable.
GloVe → 1D conv filters of varying widths → max-pool → classifier.
Sliding kernel over a 1D sequence. Output element = elementwise multiply + sum of an n-token window:
1D conv: kernel (1, 2) slides over input; first output is 0 \cdot 1 + 1 \cdot 2 = 2.
Embedding dim = input channels. Kernel has the same channel count; output is single-channel (or multi if you have multiple kernels).
3-channel 1D conv.
tensor([ 2., 5., 8., 11., 14., 17.])
Equivalent to a 2D conv with kernel height = input height:
def corr1d_multi_in(X, K):
# First, iterate through the 0th dimension (channel dimension) of `X` and
# `K`. Then, add them together
return sum(corr1d(x, k) for x, k in zip(X, K))
X = d2l.tensor([[0, 1, 2, 3, 4, 5, 6],
[1, 2, 3, 4, 5, 6, 7],
[2, 3, 4, 5, 6, 7, 8]])
K = d2l.tensor([[1, 2], [3, 4], [-1, -3]])
corr1d_multi_in(X, K)tensor([ 2., 8., 14., 20., 26., 32.])
Take the max over the time axis for each filter. Resulting feature is independent of where in the sequence the n-gram appeared. One scalar per filter, regardless of sentence length:
Max-over-time = max along the sequence axis.
Embedding (frozen GloVe + a fine-tunable copy) → parallel 1D convs at widths 3, 4, 5 → max-over-time → concat → dropout → linear:
class TextCNN(nn.Module):
def __init__(self, vocab_size, embed_size, kernel_sizes, num_channels,
**kwargs):
super(TextCNN, self).__init__(**kwargs)
self.embedding = nn.Embedding(vocab_size, embed_size)
# The embedding layer not to be trained
self.constant_embedding = nn.Embedding(vocab_size, embed_size)
self.dropout = nn.Dropout(0.5)
self.decoder = nn.Linear(sum(num_channels), 2)
# The max-over-time pooling layer has no parameters, so this instance
# can be shared
self.pool = nn.AdaptiveMaxPool1d(1)
self.relu = nn.ReLU()
# Create multiple one-dimensional convolutional layers
self.convs = nn.ModuleList()
for c, k in zip(num_channels, kernel_sizes):
self.convs.append(nn.Conv1d(2 * embed_size, c, k))
def forward(self, inputs):
# Concatenate two embedding layer outputs with shape (batch size, no.
# of tokens, token vector dimension) along vectors
embeddings = torch.cat((
self.embedding(inputs), self.constant_embedding(inputs)), dim=2)
# Per the input format of one-dimensional convolutional layers,
# rearrange the tensor so that the second dimension stores channels
embeddings = embeddings.permute(0, 2, 1)
# For each one-dimensional convolutional layer, after max-over-time
# pooling, a tensor of shape (batch size, no. of channels, 1) is
# obtained. Remove the last dimension and concatenate along channels
encoding = torch.cat([
torch.squeeze(self.relu(self.pool(conv(embeddings))), dim=-1)
for conv in self.convs], dim=1)
outputs = self.decoder(self.dropout(encoding))
return outputsThe concrete model uses 100 channels at each kernel width. After max-over-time pooling, the classifier sees sum(num_channels) features, independent of review length.
embed_size, kernel_sizes, nums_channels = 100, [3, 4, 5], [100, 100, 100]
devices = d2l.try_all_gpus()
net = TextCNN(len(vocab), embed_size, kernel_sizes, nums_channels)
def init_weights(module):
if type(module) in (nn.Linear, nn.Conv1d):
nn.init.xavier_uniform_(module.weight)
net.apply(init_weights);Both embedding tables start from the same GloVe vectors: one stays fixed as a semantic anchor, the other is fine-tuned for sentiment-specific cues.
CNNs train fast because all windows are processed in parallel. Use the metric output to compare with the BiLSTM deck: similar accuracy, less sequential computation.
loss 0.091, train acc 0.969, test acc 0.862
14373.9 examples/sec on [device(type='cuda', index=0)]