Probability and Statistics

Probability for Learning

Most of machine learning is inference under uncertainty:

  • Models output distributions over labels, not labels.
  • Losses are negative log-likelihoods.
  • Generalization, regularization, and Bayesian methods all rest on probability.

The chapter’s running example: tossing a fair coin. As the sample count grows, empirical frequencies converge to the true P = 0.5:

A fair coin, 100 tosses

The standard d2l prelude (plus a multinomial distribution we’ll use shortly):

%matplotlib inline
from d2l import tensorflow as d2l
import random
import tensorflow as tf
from tensorflow_probability import distributions as tfd

We can simulate coin flips with random.random():

num_tosses = 100
heads = sum([random.random() > 0.5 for _ in range(num_tosses)])
tails = num_tosses - heads
print("heads, tails: ", [heads, tails])
heads, tails:  [46, 54]

The split is near 50/50 but not exactly — sampling has variance.

Sampling from a distribution

A cleaner abstraction: a Multinomial over the categories {heads, tails} with probabilities [0.5, 0.5]. One call returns the count vector for 100 tosses:

fair_probs = tf.ones(2) / 2
tfd.Multinomial(100, fair_probs).sample()
<tf.Tensor: shape=(2,), dtype=float32, numpy=array([49., 51.], dtype=float32)>

Divide by the trial count to get empirical frequencies — estimates of P(\text{heads}) and P(\text{tails}):

tfd.Multinomial(100, fair_probs).sample() / 100
<tf.Tensor: shape=(2,), dtype=float32, numpy=array([0.59, 0.41], dtype=float32)>

More tosses, tighter estimate

With 10 000 tosses, the empirical frequencies sit much closer to 0.5:

counts = tfd.Multinomial(10000, fair_probs).sample()
counts / 10000
<tf.Tensor: shape=(2,), dtype=float32, numpy=array([0.5027, 0.4973], dtype=float32)>

This is the law of large numbers: as n \to \infty the empirical mean converges to the true mean.

Convergence in pictures

Plot the running estimate of P(\text{heads}) and P(\text{tails}) vs. sample count — the curves zigzag toward 0.5:

counts = tfd.Multinomial(1, fair_probs).sample(10000)
cum_counts = tf.cumsum(counts, axis=0)
estimates = cum_counts / tf.reduce_sum(cum_counts, axis=1, keepdims=True)
estimates = estimates.numpy()
d2l.set_figsize((4.5, 3.5))
d2l.plt.plot(estimates[:, 0], label=("P(coin=heads)"))
d2l.plt.plot(estimates[:, 1], label=("P(coin=tails)"))
d2l.plt.axhline(y=0.5, color='black', linestyle='dashed')
d2l.plt.gca().set_xlabel('Samples')
d2l.plt.gca().set_ylabel('Estimated probability')
d2l.plt.legend();

The variance of the estimate shrinks like 1/\sqrt{n} — doubling accuracy means quadrupling the sample budget.

Recap

  • A probability distribution assigns mass to events.
  • Sampling + counting = empirical frequencies.
  • The law of large numbers connects the two: estimates converge to the true probabilities at rate O(1/\sqrt{n}).
  • The rest of the chapter formalizes random variables, expectations, and joint / conditional / marginal distributions — the vocabulary for everything that follows.