Working with Sequences

Working with Sequences

Sequences are everywhere — text, speech, time-series, video. Three concepts set up the rest of the chapter:

  • Autoregressive — predict x_t given (x_{t-\tau}, \ldots, x_{t-1}). Reduces sequence modeling to regression.
  • Markov assumption — only the last \tau steps matter.
  • Multi-step prediction — feeding predictions back as inputs makes errors compound rapidly.

A latent autoregressive model

This deck demos on a noisy sine wave: “predict the next value” is much easier than “predict the next 64 values.”

Generating data

A noisy sine wave, 1000 time steps:

%matplotlib inline
from d2l import jax as d2l
import jax
from jax import numpy as jnp
import numpy as np
class Data(d2l.DataModule):
    def __init__(self, batch_size=16, T=1000, num_train=600, tau=4):
        self.save_hyperparameters()
        self.time = d2l.arange(1, T + 1, dtype=d2l.float32)
        key = d2l.get_key()
        self.x = d2l.sin(0.01 * self.time) + jax.random.normal(key,
                                                               [T]) * 0.2
data = Data()
d2l.plot(data.time, data.x, 'time', 'x', xlim=[1, 1000], figsize=(6, 3))

Autoregressive features

Each example is the next value x_t given the last \tau: \mathbf{x}_t = [x_{t-\tau}, \ldots, x_{t-1}]. Train a linear regressor on the first 600 windows:

def get_dataloader(self, train):
    features = [self.x[i : self.T-self.tau+i] for i in range(self.tau)]
    self.features = d2l.stack(features, 1)
    self.labels = d2l.reshape(self.x[self.tau:], (-1, 1))
    i = slice(0, self.num_train) if train else slice(self.num_train, None)
    return self.get_tensorloader([self.features, self.labels], train, i)
model = d2l.LinearRegression(lr=0.01)
trainer = d2l.Trainer(max_epochs=5)
trainer.fit(model, data)

One-step prediction

Predict \hat{x}_t from the true previous \tau values. Looks great:

onestep_preds = model.apply({'params': trainer.state.params}, data.features)
d2l.plot(data.time[data.tau:], [data.labels, onestep_preds], 'time', 'x',
         legend=['labels', '1-step preds'], figsize=(6, 3))

multistep_preds = d2l.zeros(data.T)
multistep_preds = multistep_preds.at[:].set(data.x)
for i in range(data.num_train + data.tau, data.T):
    pred = model.apply({'params': trainer.state.params},
                       d2l.reshape(multistep_preds[i-data.tau : i], (1, -1)))
    multistep_preds = multistep_preds.at[i].set(pred.item())
d2l.plot([data.time[data.tau:], data.time[data.num_train+data.tau:]],
         [onestep_preds, multistep_preds[data.num_train+data.tau:]], 'time',
         'x', legend=['1-step preds', 'multistep preds'], figsize=(6, 3))

Multi-step prediction

But forecasting more than one step requires feeding predicted values back as inputs — errors compound:

def k_step_pred(k):
    features = []
    for i in range(data.tau):
        features.append(data.x[i : i+data.T-data.tau-k+1])
    # The (i+tau)-th element stores the (i+1)-step-ahead predictions
    for i in range(k):
        preds = model.apply({'params': trainer.state.params},
                            d2l.stack(features[i : i+data.tau], 1))
        features.append(d2l.reshape(preds, -1))
    return features[data.tau:]
steps = (1, 4, 16, 64)
preds = k_step_pred(steps[-1])
d2l.plot(data.time[data.tau+steps[-1]-1:],
         [d2l.numpy(preds[k-1]) for k in steps], 'time', 'x',
         legend=[f'{k}-step preds' for k in steps], figsize=(6, 3))

The 1- and 4-step curves track the truth; 16- and 64-step predictions decay to noise. Long-horizon forecasting is hard.

Recap

  • Autoregressive: predict x_t given a window of past values.
  • Markov assumption: only the last \tau matter.
  • One-step prediction is easy; multi-step compounds errors exponentially.
  • Specialized recurrent / attention architectures (RNN, LSTM, Transformer) are the rest of the chapter’s response to this fundamental difficulty.