Random minibatch sampling

Language Models

Language models

A language model assigns a probability to a sequence of tokens:

P(x_1, x_2, \dots, x_T) = \prod_{t=1}^T P(x_t \mid x_{<t}).

That decomposition is the heart of every modern LM — predict the next token given everything before it.

Training data: shifted targets

Five (input, target) pairs from a length-5 partition: targets are inputs shifted by one.

What this deck sets up

  • n-gram models with Laplace smoothing — the classical baseline.
  • Perplexity = 2^{H} — the standard quality metric.
  • Partitioning the corpus into training minibatches.

n-gram approximation

The full context x_{<t} grows with time. An n-gram model uses a Markov approximation:

P(x_t \mid x_{<t}) \approx P(x_t \mid x_{t-n+1}, \ldots, x_{t-1}).

For bigrams:

\hat P(x_t \mid x_{t-1}) = \frac{n(x_{t-1}, x_t)}{n(x_{t-1})}.

Smoothing keeps unseen events from getting probability zero.

Perplexity

Perplexity is exponentiated average negative log-likelihood:

\operatorname{PPL} = \exp\left(-\frac{1}{T}\sum_{t=1}^{T} \log P(x_t \mid x_{<t})\right).

Lower is better. Perfect prediction gives 1; a uniform guess over |\mathcal{V}| tokens gives |\mathcal{V}|.

Setup

We’ll build LMs on the Time Machine corpus introduced in the previous chapter:

from d2l import torch as d2l
import torch

For each batch we draw random subsequences of length num_steps from the corpus. Targets are inputs shifted by one — “predict the next token”:

@d2l.add_to_class(d2l.TimeMachine)
def __init__(self, batch_size, num_steps, num_train=10000, num_val=5000):
    super(d2l.TimeMachine, self).__init__()
    self.save_hyperparameters()
    corpus, self.vocab = self.build(self._download())
    array = d2l.tensor([corpus[i:i+num_steps+1] 
                        for i in range(len(corpus)-num_steps)])
    self.X, self.Y = array[:,:-1], array[:,1:]

Sequential sampling: consecutive batches keep adjacent subsequences, useful for stateful RNNs:

@d2l.add_to_class(d2l.TimeMachine)
def get_dataloader(self, train):
    idx = slice(0, self.num_train) if train else slice(
        self.num_train, self.num_train + self.num_val)
    return self.get_tensorloader([self.X, self.Y], train, idx)

A small example — vocab of digits, num_steps=10:

data = d2l.TimeMachine(batch_size=2, num_steps=10)
for X, Y in data.train_dataloader():
    print('X:', X, '\nY:', Y)
    break
X: tensor([[ 3,  5, 13,  2,  1,  3,  9,  2, 10,  2],
        [ 1, 16,  5, 12, 21, 19,  1, 15,  4,  6]]) 
Y: tensor([[ 5, 13,  2,  1,  3,  9,  2, 10,  2,  1],
        [16,  5, 12, 21, 19,  1, 15,  4,  6,  1]])

Each batch yields an (X, Y) pair where Y is X shifted by one position.

Recap

  • A language model factors P(x_{1:T}) = \prod_t P(x_t \mid x_{<t}).
  • Perplexity = \exp(\text{avg cross-entropy}) — geometric mean of “how many guesses to get the next token.”
  • n-gram with Laplace smoothing is the classical baseline (every modern LM beats it).
  • Training data = randomly sampled length-num_steps subsequences; target = input shifted by one. Same protocol drives every RNN / Transformer LM in the book.