from d2l import tensorflow as d2l
import tensorflow as tfA language model assigns a probability to a sequence of tokens:
P(x_1, x_2, \dots, x_T) = \prod_{t=1}^T P(x_t \mid x_{<t}).
That decomposition is the heart of every modern LM — predict the next token given everything before it.
Five (input, target) pairs from a length-5 partition: targets are inputs shifted by one.
The full context x_{<t} grows with time. An n-gram model uses a Markov approximation:
P(x_t \mid x_{<t}) \approx P(x_t \mid x_{t-n+1}, \ldots, x_{t-1}).
For bigrams:
\hat P(x_t \mid x_{t-1}) = \frac{n(x_{t-1}, x_t)}{n(x_{t-1})}.
Smoothing keeps unseen events from getting probability zero.
Perplexity is exponentiated average negative log-likelihood:
\operatorname{PPL} = \exp\left(-\frac{1}{T}\sum_{t=1}^{T} \log P(x_t \mid x_{<t})\right).
Lower is better. Perfect prediction gives 1; a uniform guess over |\mathcal{V}| tokens gives |\mathcal{V}|.
We’ll build LMs on the Time Machine corpus introduced in the previous chapter:
For each batch we draw random subsequences of length num_steps from the corpus. Targets are inputs shifted by one — “predict the next token”:
@d2l.add_to_class(d2l.TimeMachine)
def __init__(self, batch_size, num_steps, num_train=10000, num_val=5000):
super(d2l.TimeMachine, self).__init__()
self.save_hyperparameters()
corpus, self.vocab = self.build(self._download())
array = d2l.tensor([corpus[i:i+num_steps+1]
for i in range(len(corpus)-num_steps)])
self.X, self.Y = array[:,:-1], array[:,1:]Sequential sampling: consecutive batches keep adjacent subsequences, useful for stateful RNNs:
A small example — vocab of digits, num_steps=10:
X: tf.Tensor(
[[ 1 3 9 2 1 16 7 10 13 2]
[18 9 3 1 18 2 3 1 7 6]], shape=(2, 10), dtype=int32)
Y: tf.Tensor(
[[ 3 9 2 1 16 7 10 13 2 10]
[ 9 3 1 18 2 3 1 7 6 2]], shape=(2, 10), dtype=int32)
Each batch yields an (X, Y) pair where Y is X shifted by one position.
num_steps subsequences; target = input shifted by one. Same protocol drives every RNN / Transformer LM in the book.