Maximum Likelihood

Maximum likelihood: pick the parameters that make the observed data most probable.

\hat\theta = \arg\max_\theta \prod_i p(x_i \mid \theta) = \arg\min_\theta -\sum_i \log p(x_i \mid \theta).

The negative log-likelihood form is what every classification and regression loss in the book actually optimizes:

Cross-entropy = NLL of a categorical p(y \mid x).
MSE = NLL of a Gaussian p(y \mid x) with fixed variance.
BPR / softmax-with-temperature etc. — all NLLs.

So “minimize the loss” is “do MLE” in fancy clothes.

A concrete example

For 9 heads and 4 tails, the likelihood curve peaks at \hat\theta = 9/13: the observed fraction of heads.

%matplotlib inline
from d2l import torch as d2l
import torch

theta = torch.arange(0, 1, 0.001)
p = theta**9 * (1 - theta)**4.

d2l.plot(theta, p, 'theta', 'likelihood')

Numerical optimization (NLL)

Sums of logs are easier than products: floating point behaves; gradients have closed forms; SGD works on the NLL.

# Set up our data
n_H = 8675309
n_T = 256245

# Initialize our parameters
theta = torch.tensor(0.5, requires_grad=True)

# Perform gradient descent
lr = 1e-9
for iter in range(100):
    loss = -(n_H * torch.log(theta) + n_T * torch.log(1 - theta))
    loss.backward()
    with torch.no_grad():
        theta -= lr * theta.grad
    theta.grad.zero_()

# Check output
theta, n_H / (n_H + n_T)

(tensor(0.9713, requires_grad=True), 0.9713101437890875)

Recap

MLE: maximize \sum_i \log p(x_i \mid \theta); equivalently, minimize NLL.
Connects optimization (the chapter’s main topic) to probability (this chapter’s main topic).
Most “losses” in DL are NLLs of suitable conditional distributions.
MLE is consistent and asymptotically efficient for well-specified models.