Synthetic Regression Data

Dive into Deep Learning · §2.3

Build a dataset whose answer you already know
so a failed fit can only be the algorithm’s fault.

Why fabricate the data?

Motivation

On real data, a poor result has three suspects at once: a wrong model, a broken optimizer, or pathological data.

Synthetic data removes the third. We choose the generative law, so the data is provably learnable:

\mathbf{y} = \mathbf{X}\mathbf{w}^* + b^* + \boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon}\sim\mathcal{N}(0,\sigma^2 I).

Recover \mathbf{w}^*,b^* → the method works. Miss them → the bug is yours, full stop.

The dataset lives in a DataModule (the object-oriented-design section): where the batches come from, kept separate from the model.

Generating the data

a DataModule that knows the ground truth

A DataModule that builds itself

Generating the data

JAX randomness is functional: thread a key in, split it for independent \mathbf{X} and \boldsymbol{\epsilon} draws (same key in → same dataset out):

class SyntheticRegressionData(d2l.DataModule):
    """Synthetic data for linear regression."""
    def __init__(self, w, b, noise=0.01, num_train=1000, num_val=1000,
                 batch_size=32, key=None):
        super().__init__()
        self.save_hyperparameters()
        # Resolve the key at call time rather than reusing a key in the signature.
        key = jax.random.key(0) if key is None else key
        n = num_train + num_val
        key1, key2 = jax.random.split(key)
        self.X = jax.random.normal(key1, (n, w.shape[0]))
        eps = jax.random.normal(key2, (n, 1)) * noise
        self.y = d2l.matmul(self.X, d2l.reshape(w, (-1, 1))) + b + eps

Fix the ground truth, then peek

Generating the data

Instantiate with the true \mathbf{w}^*=[2,-3.4]^\top, b^*=4.2:

data = SyntheticRegressionData(w=d2l.tensor([2, -3.4]), b=4.2)

Each feature row is a vector in \mathbb{R}^2; each label is a scalar:

print('features:', data.X[0],'\nlabel:', data.y[0])

features: [ 1.0040143 -0.9063372] 
label: [9.265151]

Memorize [2, -3.4] and 4.2: the next two sections train models whose only pass mark is giving these numbers back.

Reading the data

minibatches, by hand and by framework

A minibatch sampler, by hand

Reading the data

Roll the minibatch loader ourselves: shuffle the indices (afresh on every training pass), then yield batch_size rows at a time (one batch is 32\times2 features, 32\times1 labels).

def get_dataloader(self, train):
    if train:
        indices = list(range(0, self.num_train))
        # The examples are read in random order
        random.shuffle(indices)
    else:
        indices = list(range(self.num_train, self.num_train+self.num_val))
    for i in range(0, len(indices), self.batch_size):
        batch_indices = d2l.tensor(indices[i: i+self.batch_size])
        yield self.X[batch_indices], self.y[batch_indices]

Transparent, but it costs three ways: all data in memory, single-threaded Python, and no prefetching to overlap loading with compute.

The built-in loader

same interface, production speed

Hand the work to the framework

The built-in loader

JAX ships no loader, so borrow TensorFlow’s and unwrap it to NumPy. The one twist is drop_remainder=train; get_dataloader then slices the train/val range and calls this.

class TensorFlowDataLoader:
    """Expose a tf.data.Dataset as re-iterable NumPy batches."""
    def __init__(self, dataset):
        self.dataset = dataset

    def __iter__(self):
        return self.dataset.as_numpy_iterator()

    def __len__(self):
        return len(self.dataset)

@d2l.add_to_class(d2l.DataModule)
def get_tensorloader(self, tensors, train, indices=slice(0, None)):
    tensors = tuple(a[indices] for a in tensors)
    # Use TensorFlow's data loader. JAX and Flax do not provide data-loading
    # functionality. `drop_remainder=train` keeps every
    # *training* minibatch the same shape, so a `@jax.jit`'d step
    # function compiles once per epoch instead of recompiling for the
    # smaller last batch.
    shuffle_buffer = tensors[0].shape[0] if train else 1
    dataset = tf.data.Dataset.from_tensor_slices(tensors).shuffle(
        buffer_size=shuffle_buffer).batch(
            self.batch_size, drop_remainder=train)
    return TensorFlowDataLoader(dataset)

Same interface, drop-in

The built-in loader

The caller sees an identical protocol, one minibatch at a time:

X, y = next(iter(data.train_dataloader()))
print('X shape:', X.shape, '\ny shape:', y.shape)

X shape: (32, 2) 
y shape: (32, 1)

JAX reports 31, not 32: drop_remainder=True discards the partial last batch, so every @jax.jit step sees one shape.

len(data.train_dataloader())

We lose 8 examples per epoch, here negligible.

Recap

Wrap-up

Synthetic data fixes the answer up front (\mathbf{w}^*=[2,-3.4], b^*=4.2), so a failed fit can only be the algorithm’s fault.
A DataModule packages where batches come from, reusable across models.

Hand-rolled vs. built-in loader: one protocol; the framework version shuffles, prefetches, parallelizes.
Watch the last batch: a loader either keeps the partial final minibatch or drops it (32 vs. 31 batches here).