Natural Language Inference: Fine-Tuning BERT

BERT for NLI

Pretrained BERT does NLI off the shelf, near state-of-the-art, with one trick: feed <cls> premise <sep> hypothesis <sep> and stick a 3-way classifier on the <cls> token.

The illustration of why BERT mattered: arbitrary sentence-pair classification reduces to a few lines of fine-tuning on a pretrained encoder.

Pipeline

BERT encoder + small MLP head on <cls>.

Setup

from d2l import jax as d2l
import jax
from jax import numpy as jnp
from flax import nnx
import optax
import numpy as np
import json
import os

Loading pretrained BERT

We use a small pretrained BERT (the one we trained ourselves in the previous chapter, or a downloaded checkpoint). The framework-specific checkpoint conversion helpers are implementation plumbing, so the slide shows only the teaching contract:

register a checkpoint URL and checksum;
load the vocabulary;
instantiate the same BERT architecture;
copy pretrained weights into the encoder.

Instantiate pretrained BERT

The loaded encoder returns contextual token representations and the <cls> representation. Fine-tuning reuses that backbone and adds only a small task head.

Encoding sentence pairs

Tokenize each (premise, hypothesis) pair into BERT input format: <cls> + premise + <sep> + hypothesis + <sep> with segment IDs distinguishing the two halves:

class SNLIBERTDataset:
    def __init__(self, dataset, max_len, vocab=None):
        all_premise_hypothesis_tokens = [[
            p_tokens, h_tokens] for p_tokens, h_tokens in zip(
            *[d2l.tokenize([s.lower() for s in sentences])
              for sentences in dataset[:2]])]
        
        self.labels = np.asarray(dataset[2], dtype=np.int32)
        self.vocab = vocab
        self.max_len = max_len
        (self.all_token_ids, self.all_segments,
         self.valid_lens) = self._preprocess(all_premise_hypothesis_tokens)
        print('read ' + str(len(self.all_token_ids)) + ' examples')

    def _preprocess(self, all_premise_hypothesis_tokens):
        # This Python token/list processing is inexpensive enough here that a
        # list comprehension avoids multiprocessing setup and serialization.
        out = [self._preprocess_pair(tokens)
               for tokens in all_premise_hypothesis_tokens]
        all_token_ids = [
            token_ids for token_ids, segments, valid_len in out]
        all_segments = [segments for token_ids, segments, valid_len in out]
        valid_lens = [valid_len for token_ids, segments, valid_len in out]
        return (np.asarray(all_token_ids, dtype=np.int32),
                np.asarray(all_segments, dtype=np.int32),
                np.asarray(valid_lens, dtype=np.float32))

    def _preprocess_pair(self, premise_hypothesis_tokens):
        p_tokens, h_tokens = premise_hypothesis_tokens
        self._truncate_pair_of_tokens(p_tokens, h_tokens)
        tokens, segments = d2l.get_tokens_and_segments(p_tokens, h_tokens)
        token_ids = self.vocab[tokens] + [self.vocab['<pad>']] \
                             * (self.max_len - len(tokens))
        segments = segments + [0] * (self.max_len - len(segments))
        valid_len = len(tokens)
        return token_ids, segments, valid_len

    def _truncate_pair_of_tokens(self, p_tokens, h_tokens):
        # Reserve slots for '<cls>', '<sep>', and '<sep>' tokens for the BERT
        # input
        while len(p_tokens) + len(h_tokens) > self.max_len - 3:
            if len(p_tokens) > len(h_tokens):
                p_tokens.pop()
            else:
                h_tokens.pop()

    def __getitem__(self, idx):
        return (self.all_token_ids[idx], self.all_segments[idx],
                self.valid_lens[idx]), self.labels[idx]

    def __len__(self):
        return len(self.all_token_ids)

# Reduce `batch_size` if there is an out of memory error. In the original BERT
# model, `max_len` = 512
batch_size, max_len = 512, 128
data_dir = d2l.download_extract('SNLI')
train_set = SNLIBERTDataset(d2l.read_snli(data_dir, True), max_len, vocab)
test_set = SNLIBERTDataset(d2l.read_snli(data_dir, False), max_len, vocab)
train_iter = d2l.load_array(
    (train_set.all_token_ids, train_set.all_segments,
     train_set.valid_lens, train_set.labels), batch_size, is_train=True)
test_iter = d2l.load_array(
    (test_set.all_token_ids, test_set.all_segments,
     test_set.valid_lens, test_set.labels), batch_size, is_train=False)

read 549367 examples
read 9824 examples

Classifier head

Tiny MLP on the <cls> representation — 3 outputs (entailment, contradiction, neutral). Encoder weights are fine-tuned end-to-end:

class BERTClassifier(nnx.Module):
    def __init__(self, bert, rngs=None):
        self.bert = bert
        rngs = nnx.Rngs(0) if rngs is None else rngs
        self.output = nnx.Linear(bert.hidden.out_features, 3, rngs=rngs)

    def __call__(self, tokens_X, segments_X, valid_lens_x):
        encoded_X = self.bert.encoder(tokens_X, segments_X, valid_lens_x)
        return self.output(jnp.tanh(self.bert.hidden(encoded_X[:, 0, :])))

net = BERTClassifier(bert)

Fine-tuning

Standard cross-entropy + Adam, low learning rate (e.g. 2e-5). Few epochs are enough — the model already knows language; we’re just teaching it the specific task. Validation accuracy is the main signal, since training loss can keep falling after the classifier starts overfitting SNLI artifacts:

lr, num_epochs = 1e-4, 5
optimizer = nnx.Optimizer(net, optax.adam(lr), wrt=nnx.Param)

@nnx.jit
def train_step(net, optimizer, tokens_X, segments_X, valid_lens_x, labels):
    def loss_fn(model):
        logits = model(tokens_X, segments_X, valid_lens_x)
        return optax.softmax_cross_entropy_with_integer_labels(
            logits, labels).mean()
    loss, grads = nnx.value_and_grad(loss_fn)(net)
    optimizer.update(net, grads)
    return loss

@nnx.jit
def eval_step(net, tokens_X, segments_X, valid_lens_x, labels):
    logits = nnx.view(net, deterministic=True)(
        tokens_X, segments_X, valid_lens_x)
    return (logits.argmax(axis=-1) == labels).sum()

for epoch in range(num_epochs):
    train_loss, n_train = jnp.array(0.0), 0
    for batch in train_iter:
        tokens_X, segments_X, valid_lens_x, labels = (
            batch[0], batch[1], batch[2], batch[3])
        loss = train_step(net, optimizer, tokens_X, segments_X,
                          valid_lens_x, labels)
        train_loss += loss * len(labels)
        n_train += len(labels)
    # Evaluate on test set
    n_correct, n_test = jnp.array(0), 0
    for batch in test_iter:
        tokens_X, segments_X, valid_lens_x, labels = (
            batch[0], batch[1], batch[2], batch[3])
        n_correct += eval_step(
            net, tokens_X, segments_X, valid_lens_x, labels)
        n_test += len(labels)
    train_loss, n_correct = float(train_loss), int(n_correct)
    print(f'epoch {epoch + 1}, loss {train_loss / n_train:.4f}, '
          f'test acc {n_correct / n_test:.4f}')

epoch 1, loss 0.7752, test acc 0.7356
epoch 2, loss 0.6329, test acc 0.7565
epoch 3, loss 0.5652, test acc 0.7741
epoch 4, loss 0.5165, test acc 0.7827
epoch 5, loss 0.4798, test acc 0.7856

Recap

Sentence-pair classification = encode <cls> A <sep> B <sep>, classify the <cls> representation.
Same recipe handles NLI, paraphrase, semantic similarity, and many more.
Fine-tuning hyperparameters: batch 32, lr ~2e-5, 2-4 epochs. Short, cheap, and reproducible.
The end of the pre-2018 NLI architecture wars: BERT made per-task model design largely obsolete.