import collections
import re
from d2l import mxnet as d2l
from mxnet import np, npx
import random
npx.set_np()Text isn’t tensors out of the box. The pipeline:
Running example: H. G. Wells’s The Time Machine (32 k tokens) — small enough to index in a notebook, big enough to train a language model on. The other half of the chapter looks at the statistics of natural-language text: long-tail word distributions, stop words, bigrams.
Word-level splits on whitespace; character-level keeps individual characters:
A Vocab class maps tokens ↔︎ integer indices, with reserved slots for <unk> (rare/OOV tokens) and a few specials:
class Vocab:
"""Vocabulary for text."""
def __init__(self, tokens=[], min_freq=0, reserved_tokens=[]):
# Flatten a 2D list if needed
if tokens and isinstance(tokens[0], list):
tokens = [token for line in tokens for token in line]
# Count token frequencies
counter = collections.Counter(tokens)
self.token_freqs = sorted(counter.items(), key=lambda x: x[1],
reverse=True)
# The list of unique tokens, ordered by descending frequency.
# Reserve <unk> at index 0 so vocab[0] is the unknown token.
self.idx_to_token = ['<unk>'] + reserved_tokens + [
token for token, freq in self.token_freqs
if freq >= min_freq and token not in reserved_tokens]
self.token_to_idx = {token: idx
for idx, token in enumerate(self.idx_to_token)}
def __len__(self):
return len(self.idx_to_token)
def __getitem__(self, tokens):
if not isinstance(tokens, (list, tuple)):
return self.token_to_idx.get(tokens, self.unk)
return [self.__getitem__(token) for token in tokens]
def to_tokens(self, indices):
if hasattr(indices, '__len__') and len(indices) > 1:
return [self.idx_to_token[int(index)] for index in indices]
return self.idx_to_token[indices]
@property
def unk(self): # Index for the unknown token
return self.token_to_idx['<unk>']Pass the tokenized corpus and (optionally) a min-frequency threshold to drop very rare tokens:
Wrap the whole pipeline in a TimeMachine.build() so models just call data.build(...) to get tensors:
Tokenize words, count occurrences, sort by decreasing count.
After the first few words, frequency is close to a straight line on log-log axes:
n_i \propto \frac{1}{i^\alpha}, \qquad \log n_i = -\alpha \log i + c.
Interpretation:
Bigrams count consecutive word pairs:
(w_t, w_{t+1}).
The most common bigrams are still dominated by function-word phrases. One exception in this corpus is semantically meaningful: the--time.
Pedagogical point: increasing context length makes the counts more specific, but also much sparser.
Trigrams count consecutive triples:
(w_t, w_{t+1}, w_{t+2}).
The vocabulary grows quickly with n, while the corpus size is fixed. Most possible triples are never observed.
This is the classic n-gram tradeoff:
Plot unigram, bigram, and trigram frequencies on log-log axes — all three follow Zipf-like power laws, with steeper slopes (and sparser high-frequency regimes) for higher n:
This long-tail sparsity is exactly why neural language models — which embed each token into a continuous space — work so much better than n-gram count tables.
LongTensor.