Converting Raw Text into Sequence Data

Text as Sequence Data

Text isn’t tensors out of the box. The pipeline:

  1. Read the raw string.
  2. Tokenize — split into characters, words, or subwords.
  3. Build a vocabulary — map each token to an integer index.
  4. Index the corpus → a sequence of ints.

Running example: H. G. Wells’s The Time Machine (32 k tokens) — small enough to index in a notebook, big enough to train a language model on. The other half of the chapter looks at the statistics of natural-language text: long-tail word distributions, stop words, bigrams.

Read

import collections
import re
from d2l import torch as d2l
import torch
import random
class TimeMachine(d2l.DataModule):
    """The Time Machine dataset."""
    def _download(self):
        fname = d2l.download(d2l.DATA_URL + 'timemachine.txt', self.root,
                             '090b5e7e70c295757f55df93cb0a180b9691891a')
        with open(fname) as f:
            return f.read()

data = TimeMachine()
raw_text = data._download()
raw_text[:60]
'The Time Machine, by H. G. Wells [1898]\n\n\n\n\nI\n\n\nThe Time Tra'
@d2l.add_to_class(TimeMachine)
def _preprocess(self, text):
    return re.sub('[^A-Za-z]+', ' ', text).lower()

text = data._preprocess(raw_text)
text[:60]
'the time machine by h g wells i the time traveller for so it'

Tokenize

Word-level splits on whitespace; character-level keeps individual characters:

@d2l.add_to_class(TimeMachine)
def _tokenize(self, text):
    return list(text)

tokens = data._tokenize(text)
','.join(tokens[:30])
't,h,e, ,t,i,m,e, ,m,a,c,h,i,n,e, ,b,y, ,h, ,g, ,w,e,l,l,s, '

Vocabulary

A Vocab class maps tokens ↔︎ integer indices, with reserved slots for <unk> (rare/OOV tokens) and a few specials:

class Vocab:
    """Vocabulary for text."""
    def __init__(self, tokens=[], min_freq=0, reserved_tokens=[]):
        # Flatten a 2D list if needed
        if tokens and isinstance(tokens[0], list):
            tokens = [token for line in tokens for token in line]
        # Count token frequencies
        counter = collections.Counter(tokens)
        self.token_freqs = sorted(counter.items(), key=lambda x: x[1],
                                  reverse=True)
        # The list of unique tokens, ordered by descending frequency.
        # Reserve <unk> at index 0 so vocab[0] is the unknown token.
        self.idx_to_token = ['<unk>'] + reserved_tokens + [
            token for token, freq in self.token_freqs
            if freq >= min_freq and token not in reserved_tokens]
        self.token_to_idx = {token: idx
                             for idx, token in enumerate(self.idx_to_token)}

    def __len__(self):
        return len(self.idx_to_token)

    def __getitem__(self, tokens):
        if not isinstance(tokens, (list, tuple)):
            return self.token_to_idx.get(tokens, self.unk)
        return [self.__getitem__(token) for token in tokens]

    def to_tokens(self, indices):
        if hasattr(indices, '__len__') and len(indices) > 1:
            return [self.idx_to_token[int(index)] for index in indices]
        return self.idx_to_token[indices]

    @property
    def unk(self):  # Index for the unknown token
        return self.token_to_idx['<unk>']

Building the vocab

Pass the tokenized corpus and (optionally) a min-frequency threshold to drop very rare tokens:

vocab = Vocab(tokens)
indices = vocab[tokens[:10]]
print('indices:', indices)
print('words:', vocab.to_tokens(indices))
indices: [3, 9, 2, 1, 3, 5, 13, 2, 1, 13]
words: ['t', 'h', 'e', ' ', 't', 'i', 'm', 'e', ' ', 'm']

One-stop dataloading

Wrap the whole pipeline in a TimeMachine.build() so models just call data.build(...) to get tensors:

@d2l.add_to_class(TimeMachine)
def build(self, raw_text, vocab=None):
    tokens = self._tokenize(self._preprocess(raw_text))
    if vocab is None: vocab = Vocab(tokens)
    corpus = [vocab[token] for token in tokens]
    return corpus, vocab

corpus, vocab = data.build(raw_text)
len(corpus), len(vocab)
(173428, 28)

Word-frequency statistics

Tokenize words, count occurrences, sort by decreasing count.

  • The head of the distribution is mostly function words: “the”, “of”, “and”, “to”, “a”, …
  • These words are common because they carry grammatical structure.
  • In old bag-of-words classifiers they were often removed as stop words; neural sequence models usually keep them.

Zipf law

After the first few words, frequency is close to a straight line on log-log axes:

n_i \propto \frac{1}{i^\alpha}, \qquad \log n_i = -\alpha \log i + c.

Interpretation:

  • A few tokens appear extremely often.
  • Most tokens are rare.
  • Count tables waste probability mass in the tail.

Word-frequency plot

Bigrams

Bigrams count consecutive word pairs:

(w_t, w_{t+1}).

The most common bigrams are still dominated by function-word phrases. One exception in this corpus is semantically meaningful: the--time.

Pedagogical point: increasing context length makes the counts more specific, but also much sparser.

Trigrams

Trigrams count consecutive triples:

(w_t, w_{t+1}, w_{t+2}).

The vocabulary grows quickly with n, while the corpus size is fixed. Most possible triples are never observed.

This is the classic n-gram tradeoff:

  • larger n captures more context;
  • larger n creates a much longer tail.

Zipf at every n

Plot unigram, bigram, and trigram frequencies on log-log axes — all three follow Zipf-like power laws, with steeper slopes (and sparser high-frequency regimes) for higher n:

This long-tail sparsity is exactly why neural language models — which embed each token into a continuous space — work so much better than n-gram count tables.

Recap

  • Pipeline: read → tokenize → build vocab → index = corpus as LongTensor.
  • Word vs character tokenization is a tradeoff between vocabulary size and sequence length; subword (BPE) is the modern default.
  • Natural-language frequencies are Zipfian at every n — long-tail sparsity makes neural models a much better fit than count-based ones.