Numerical optimization (NLL)

Maximum Likelihood

Maximum likelihood: pick the parameters that make the observed data most probable.

\hat\theta = \arg\max_\theta \prod_i p(x_i \mid \theta) = \arg\min_\theta -\sum_i \log p(x_i \mid \theta).

The negative log-likelihood form is what every classification and regression loss in the book actually optimizes:

Cross-entropy = NLL of a categorical p(y \mid x).
MSE = NLL of a Gaussian p(y \mid x) with fixed variance.
BPR / softmax-with-temperature etc. — all NLLs.

So “minimize the loss” is “do MLE” in fancy clothes.

A concrete example

For 9 heads and 4 tails, the likelihood curve peaks at \hat\theta = 9/13: the observed fraction of heads.

%matplotlib inline
from d2l import mxnet as d2l
from mxnet import autograd, np, npx
npx.set_np()

theta = np.arange(0, 1, 0.001)
p = theta**9 * (1 - theta)**4.

d2l.plot(theta, p, 'theta', 'likelihood')

Sums of logs are easier than products: floating point behaves; gradients have closed forms; SGD works on the NLL.

# Set up our data
n_H = 8675309
n_T = 256245

# Initialize our parameters
theta = np.array(0.5)
theta.attach_grad()

# Perform gradient descent
lr = 1e-9
for iter in range(100):
    with autograd.record():
        loss = -(n_H * np.log(theta) + n_T * np.log(1 - theta))
    loss.backward()
    theta -= lr * theta.grad

# Check output
theta, n_H / (n_H + n_T)

Recap

MLE: maximize \sum_i \log p(x_i \mid \theta); equivalently, minimize NLL.
Connects optimization (the chapter’s main topic) to probability (this chapter’s main topic).
Most “losses” in DL are NLLs of suitable conditional distributions.
MLE is consistent and asymptotically efficient for well-specified models.