%matplotlib inline
from d2l import tensorflow as d2l
import tensorflow as tf
theta = tf.range(0, 1, 0.001)
p = theta**9 * (1 - theta)**4.
d2l.plot(theta, p, 'theta', 'likelihood')Maximum likelihood: pick the parameters that make the observed data most probable.
\hat\theta = \arg\max_\theta \prod_i p(x_i \mid \theta) = \arg\min_\theta -\sum_i \log p(x_i \mid \theta).
The negative log-likelihood form is what every classification and regression loss in the book actually optimizes:
So “minimize the loss” is “do MLE” in fancy clothes.
For 9 heads and 4 tails, the likelihood curve peaks at \hat\theta = 9/13: the observed fraction of heads.
Sums of logs are easier than products: floating point behaves; gradients have closed forms; SGD works on the NLL.
# Set up our data
n_H = 8675309
n_T = 256245
# Initialize our parameters
theta = tf.Variable(tf.constant(0.5))
# Perform gradient descent
lr = 1e-9
for iter in range(100):
with tf.GradientTape() as t:
loss = -(n_H * tf.math.log(theta) + n_T * tf.math.log(1 - theta))
theta.assign_sub(lr * t.gradient(loss, theta))
# Check output
theta, n_H / (n_H + n_T)(<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=0.9713100790977478>,
0.9713101437890875)