%matplotlib inline
from d2l import torch as d2l
import math
import torch
import numpy as np
import timeThe simplest predictive model: a linear function of the inputs plus a bias.
\hat{y} = \mathbf{w}^\top \mathbf{x} + b.
w and b to minimize squared error on training data.Assume the target is the linear prediction plus fixed-variance Gaussian noise:
y^{(i)} = \mathbf{w}^\top \mathbf{x}^{(i)} + b + \epsilon^{(i)}, \quad \epsilon^{(i)} \sim \mathcal{N}(0, \sigma^2).
Then
p(y^{(i)} \mid \mathbf{x}^{(i)}, \mathbf{w}, b) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y^{(i)} - \hat{y}^{(i)})^2}{2\sigma^2}\right).
For independent examples, the negative log-likelihood is
-\log p(\mathbf{y}\mid\mathbf{X},\mathbf{w},b) = \textrm{const} + \frac{1}{2\sigma^2}\sum_i (y^{(i)}-\hat{y}^{(i)})^2.
With fixed \sigma, maximum likelihood and minimizing squared error choose the same parameters.
For one example \mathbf{x}^{(i)} \in \mathbb{R}^d and target y^{(i)} \in \mathbb{R}, the model predicts
\hat{y}^{(i)} = \mathbf{w}^\top \mathbf{x}^{(i)} + b.
Squared loss on the training set of n examples:
L(\mathbf{w}, b) = \frac{1}{n} \sum_{i=1}^{n} \tfrac{1}{2}\left(\hat{y}^{(i)} - y^{(i)}\right)^2.
Convex in (\mathbf{w}, b) — every local minimum is global.
Closed form (when it fits in memory):
\mathbf{w}^* = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}.
Doesn’t generalize beyond linear models.
Minibatch SGD (the recipe we’ll keep using):
\mathbf{w} \leftarrow \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \nabla_\mathbf{w}\,\ell^{(i)}(\mathbf{w}, b).
Same operation, two implementations. Set up two 10 000-element vectors:
Adding element-by-element in a Python loop:
'0.09788 sec'
Assume each label is the linear prediction plus Gaussian noise:
y = \mathbf{w}^\top \mathbf{x} + b + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2).
Then minimizing squared error is exactly maximizing the Gaussian log-likelihood of the observed labels.
Plot a few normal densities — different means and variances:
Squared loss assumes the errors look like one of these bells centered at the model’s prediction.