Linear Regression

Linear regression in one model

The simplest predictive model: a linear function of the inputs plus a bias.

\hat{y} = \mathbf{w}^\top \mathbf{x} + b.

  • Fit w and b to minimize squared error on training data.
  • Has a closed-form solution for problems where it scales, but the iterative recipe (gradient descent on minibatches) generalizes to everything that follows.
  • Connects neatly to the Gaussian noise model — squared loss = negative log-likelihood under \mathcal{N}(0, \sigma^2).

Squared loss from Gaussian noise

Assume the target is the linear prediction plus fixed-variance Gaussian noise:

y^{(i)} = \mathbf{w}^\top \mathbf{x}^{(i)} + b + \epsilon^{(i)}, \quad \epsilon^{(i)} \sim \mathcal{N}(0, \sigma^2).

Then

p(y^{(i)} \mid \mathbf{x}^{(i)}, \mathbf{w}, b) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y^{(i)} - \hat{y}^{(i)})^2}{2\sigma^2}\right).

For independent examples, the negative log-likelihood is

-\log p(\mathbf{y}\mid\mathbf{X},\mathbf{w},b) = \textrm{const} + \frac{1}{2\sigma^2}\sum_i (y^{(i)}-\hat{y}^{(i)})^2.

With fixed \sigma, maximum likelihood and minimizing squared error choose the same parameters.

The model and the loss

For one example \mathbf{x}^{(i)} \in \mathbb{R}^d and target y^{(i)} \in \mathbb{R}, the model predicts

\hat{y}^{(i)} = \mathbf{w}^\top \mathbf{x}^{(i)} + b.

Squared loss on the training set of n examples:

L(\mathbf{w}, b) = \frac{1}{n} \sum_{i=1}^{n} \tfrac{1}{2}\left(\hat{y}^{(i)} - y^{(i)}\right)^2.

Convex in (\mathbf{w}, b) — every local minimum is global.

Two ways to fit

Closed form (when it fits in memory):

\mathbf{w}^* = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}.

Doesn’t generalize beyond linear models.

Minibatch SGD (the recipe we’ll keep using):

\mathbf{w} \leftarrow \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \nabla_\mathbf{w}\,\ell^{(i)}(\mathbf{w}, b).

  • Sample minibatch \mathcal{B}.
  • Compute gradient of the average loss on it.
  • Step in the negative-gradient direction.

Vectorization for speed

Same operation, two implementations. Set up two 10 000-element vectors:

%matplotlib inline
from d2l import jax as d2l
from jax import numpy as jnp
import math
import time
n = 10000
a = d2l.ones(n)
b = d2l.ones(n)

Loop vs. vectorized add

Adding element-by-element in a Python loop:

# JAX arrays are immutable, meaning that once created their contents
# cannot be changed. For updating individual elements, JAX provides
# an indexed update syntax that returns an updated copy
c = d2l.zeros(n)
t = time.time()
for i in range(n):
    c = c.at[i].set(a[i] + b[i])
f'{time.time() - t:.5f} sec'
'5.24810 sec'

The same answer in one library call:

t = time.time()
d = a + b
f'{time.time() - t:.5f} sec'
'0.05605 sec'

Roughly 3 orders of magnitude faster on this size — Python’s interpreter overhead is the killer; the C kernel barely breaks a sweat.

Why squared loss?

Assume each label is the linear prediction plus Gaussian noise:

y = \mathbf{w}^\top \mathbf{x} + b + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2).

Then minimizing squared error is exactly maximizing the Gaussian log-likelihood of the observed labels.

def normal(x, mu, sigma):
    p = 1 / math.sqrt(2 * math.pi * sigma**2)
    return p * jnp.exp(-0.5 * (x - mu)**2 / sigma**2)

Visualizing the connection

Plot a few normal densities — different means and variances:

# Use NumPy again for visualization
x = jnp.arange(-7, 7, 0.01)
# Mean and standard deviation pairs
params = [(0, 1), (0, 2), (3, 1)]
d2l.plot(x, [normal(x, mu, sigma) for mu, sigma in params], xlabel='x',
         ylabel='p(x)', figsize=(4.5, 2.5),
         legend=[f'mean {mu}, std {sigma}' for mu, sigma in params])

Squared loss assumes the errors look like one of these bells centered at the model’s prediction.

Recap

  • Model: \hat{y} = \mathbf{w}^\top \mathbf{x} + b.
  • Loss: mean squared error — convex, single global optimum.
  • Optimizer: minibatch SGD steps in the gradient direction; closed form exists but doesn’t generalize.
  • Vectorize every inner loop — orders of magnitude faster than Python iteration.
  • Squared loss is the MLE under Gaussian noise — sets the template for matching loss functions to noise models.