Linear Regression

Dive into Deep Learning · §2.1

The straight line through the data
and the recipe behind every loss function after it.

Predicting a number

Motivation

Collect house sales: each has an area, an age, a price.
Bigger houses cost more, not exactly, but on average.
Regression draws the line and turns it into a prediction for a house nobody has seen.

Features \mathbf{x}, label y; we model E[Y \mid \mathbf{x}]. Two things are missing: a measure of how wrong we are, and a way to improve. This section supplies both.

The Model

a dot product, a loss, and two ways to minimize

The whole model is one dot product

The Model

Stack d features into \mathbf{x}\in\mathbb{R}^d and weights into \mathbf{w}\in\mathbb{R}^d:

\hat{y} = w_1 x_1 + \cdots + w_d x_d + b = \mathbf{w}^\top \mathbf{x} + b.

For the whole dataset at once, the design matrix \mathbf{X}\in\mathbb{R}^{n\times d} holds one example per row:

\hat{\mathbf{y}} = \mathbf{X}\mathbf{w} + b.

The bias b lets the line miss the origin, making the map affine rather than linear. Learning = choosing (\mathbf{w}, b).

Squared loss charges each miss by its square

The Model

Average the per-example penalties \tfrac12\bigl(\hat{y}^{(i)} - y^{(i)}\bigr)^2:

L(\mathbf{w}, b) = \frac{1}{n} \sum_{i=1}^{n} \tfrac{1}{2}\bigl(\hat{y}^{(i)} - y^{(i)}\bigr)^2.

L is convex in (\mathbf{w}, b), so every local minimum is the global one.

Each vertical gap is a residual; the loss sums their squares.

Large errors hurt quadratically: strong incentive to avoid big misses, and outsized sensitivity to anomalous points.

One corrupted label puts the square on trial

The Model · one bad label

Twenty points sit exactly on y = 2x; we corrupt a single label to 10000 and fit the slope twice, squared loss versus mean absolute error:

x = jnp.arange(1.0, 21.0)
y = 2 * x
y = y.at[5].set(10000)                         # corrupt a single label
w_sq = (x * y).sum() / (x * x).sum()           # closed-form squared-loss fit
w_mae = 0.0                                    # subgradient descent on MAE
for _ in range(2000):
    w_mae -= 0.002 * (jnp.sign(w_mae * x - y) * x).mean()
print(f'true w: 2.00, squared loss: {float(w_sq):.2f}, MAE: {float(w_mae):.2f}')

One bad point in twenty drags the squared-loss slope an order of magnitude from the true 2.0, while the MAE fit barely moves. The probabilistic view below explains both behaviors.

Two ways to reach the minimum

The Model

Closed form, by setting the gradient to zero:

\mathbf{w}^* = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}.

Exact, but it needs a matrix inverse, and it exists only for linear models with squared loss.

Minibatch SGD, the iterative recipe reused by every model in this book:

(\mathbf{w}, b) \leftarrow (\mathbf{w}, b) - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \nabla_{(\mathbf{w}, b)}\,\ell^{(i)}(\mathbf{w}, b).

The closed form is a projection

The Model · geometry

As \mathbf{w} varies, \mathbf{X}\mathbf{w} sweeps the column space of \mathbf{X}. The best fit is the point of that subspace closest to \mathbf{y}: the orthogonal projection.

The residual \mathbf{y}-\mathbf{X}\mathbf{w}^* is what is left over, and it must be perpendicular to every feature column, exactly the normal equation \mathbf{X}^\top(\mathbf{X}\mathbf{w}^*-\mathbf{y})=\mathbf{0}.

Projecting a vector onto a direction: the residual meets it at a right angle.

Minibatch SGD, step by step

The Model

Initialize \mathbf{w}, b at random.
Sample a minibatch \mathcal{B} (size 32–256: a full batch is slow, a single point is noisy).
Average the per-example gradients on \mathcal{B}.
Step a small distance \eta (the learning rate) downhill.

With a constant \eta, SGD never lands on the minimizer: it hovers in a noise ball whose squared radius scales with \eta. Shrinking \eta shrinks the ball, the reason learning-rate schedules exist (the stochastic-and-adaptive-methods section).

Vectorization

why the inner loop never lives in Python

A thousand interpreter trips, or one kernel call

Vectorization

Add two 1000-element vectors one coordinate at a time, each + a separate trip through the Python interpreter:

# JAX arrays are immutable, meaning that once created their contents
# cannot be changed. For updating individual elements, JAX provides
# an indexed update syntax that returns an updated copy
c = d2l.zeros(n)
t = time.time()
for i in range(n):
    c = c.at[i].set(a[i] + b[i])
print(f'{time.time() - t:.5f} sec')

0.92420 sec

Or hand the whole array to one compiled kernel:

t = time.time()
d = a + b
print(f'{time.time() - t:.5f} sec')

0.05508 sec

Identical math, orders-of-magnitude different cost, and the gap grows with vector length. Push inner loops into the library, never Python.

Where Losses Come From

squared error is a probabilistic assumption in disguise

Assume the errors are Gaussian

Where losses come from

Model each label as the linear prediction plus bell-curve noise:

y = \mathbf{w}^\top \mathbf{x} + b + \epsilon, \qquad \epsilon \sim \mathcal{N}(0, \sigma^2).

Shifting the mean slides the bell; growing the variance flattens it. Note how fast the tails die: under a Gaussian, a huge error is essentially impossible.

Maximum likelihood turns the assumption into the loss

Where losses come from

The Gaussian assumption prices every y: P(y\mid\mathbf{x}) = \tfrac{1}{\sqrt{2\pi\sigma^2}} \exp\!\bigl(-\tfrac{(y-\hat{y})^2}{2\sigma^2}\bigr). Maximize the likelihood of the dataset = minimize its negative log:

-\log P(\mathbf{y}\mid\mathbf{X}) = \textrm{const} + \frac{1}{2\sigma^2}\sum_i \bigl(y^{(i)}-\hat{y}^{(i)}\bigr)^2.

The constant and \sigma drop out: maximum likelihood under Gaussian noise is squared error. The square was never arbitrary: it is the Gaussian’s (\cdot)^2, inherited.

This also explains the outlier demo: squared loss trusted the Gaussian’s thin tails, so a 10000 where 12 was expected read as impossible and dominated the fit.

The recipe: match the loss to the noise model

Where losses come from · payoff

Choose a noise model, minimize its negative log-likelihood.

Gaussian → squared error
Laplace → absolute error
Gaussian on \log y → log-price regression
Poisson → \lambda - k\log\lambda (counts)

Laplace’s heavy tails expect the occasional wild point and penalize it only linearly, the robust MAE fit from the outlier demo.

Left: Laplace tails carry far more mass than Gaussian tails of equal variance. Right: the penalties each induces, with Huber between them.

The weight-decay section adds a prior to this likelihood → weight decay; the next chapter runs the recipe on categorical noise → softmax.

A Neural Network

one neuron, and a name to be careful with

Linear regression is a one-neuron network

A Neural Network

Wire every input x_1,\ldots,x_d directly to a single output o_1.

The output is the same weighted sum \sum_i w_i x_i + b, so linear regression is a single-layer, fully connected network: d inputs, one computed neuron, the atom that deep networks stack.

Linear regression drawn as a one-layer network: inputs feed a single output.

Inspiration, not blueprint

A Neural Network

The cartoon that inspired the name: dendrites collect inputs x_i, weighted by synaptic strengths w_i; the nucleus sums them; the axon carries the result on.

Planes were inspired by birds, but aeronautics is not ornithology: today’s deep learning draws at least as much on mathematics, statistics, and computer science as on the brain.

A biological neuron: dendrites in, nucleus sums, axon out.

Recap

Wrap-up

Model: \hat{y} = \mathbf{w}^\top \mathbf{x} + b, one dot product per prediction.
Loss: mean squared error; convex, one global optimum.
Closed form = orthogonal projection; minibatch SGD = the workhorse, hovering in an \eta-sized noise ball.

Vectorize: one kernel call, never a Python inner loop.
One bad label: in twenty, slope 22.88 vs. 2.02, the square’s thin-tailed trust exposed.
The recipe: squared loss is Gaussian maximum likelihood; swap the noise model, get the right loss.