Probability and Statistics

Probability for Learning

Most of machine learning is inference under uncertainty:

  • Models output distributions over labels, not labels.
  • Losses are negative log-likelihoods.
  • Generalization, regularization, and Bayesian methods all rest on probability.

The chapter’s running example: tossing a fair coin. As the sample count grows, empirical frequencies converge to the true P = 0.5:

A fair coin, 100 tosses

The standard d2l prelude (plus a multinomial distribution we’ll use shortly):

%matplotlib inline
from d2l import jax as d2l
import random
import jax
from jax import numpy as jnp
import numpy as np

We can simulate coin flips with random.random():

num_tosses = 100
heads = sum([random.random() > 0.5 for _ in range(num_tosses)])
tails = num_tosses - heads
print("heads, tails: ", [heads, tails])
heads, tails:  [52, 48]

The split is near 50/50 but not exactly — sampling has variance.

Sampling from a distribution

A cleaner abstraction: a Multinomial over the categories {heads, tails} with probabilities [0.5, 0.5]. One call returns the count vector for 100 tosses:

fair_probs = [0.5, 0.5]
# jax.random does not have multinomial distribution implemented
np.random.multinomial(100, fair_probs)
array([52, 48])

Divide by the trial count to get empirical frequencies — estimates of P(\text{heads}) and P(\text{tails}):

np.random.multinomial(100, fair_probs) / 100
array([0.55, 0.45])

More tosses, tighter estimate

With 10 000 tosses, the empirical frequencies sit much closer to 0.5:

counts = np.random.multinomial(10000, fair_probs).astype(np.float32)
counts / 10000
array([0.4983, 0.5017], dtype=float32)

This is the law of large numbers: as n \to \infty the empirical mean converges to the true mean.

Convergence in pictures

Plot the running estimate of P(\text{heads}) and P(\text{tails}) vs. sample count — the curves zigzag toward 0.5:

counts = np.random.multinomial(1, fair_probs, size=10000).astype(np.float32)
cum_counts = counts.cumsum(axis=0)
estimates = cum_counts / cum_counts.sum(axis=1, keepdims=True)
d2l.set_figsize((4.5, 3.5))
d2l.plt.plot(estimates[:, 0], label=("P(coin=heads)"))
d2l.plt.plot(estimates[:, 1], label=("P(coin=tails)"))
d2l.plt.axhline(y=0.5, color='black', linestyle='dashed')
d2l.plt.gca().set_xlabel('Samples')
d2l.plt.gca().set_ylabel('Estimated probability')
d2l.plt.legend();

The variance of the estimate shrinks like 1/\sqrt{n} — doubling accuracy means quadrupling the sample budget.

Recap

  • A probability distribution assigns mass to events.
  • Sampling + counting = empirical frequencies.
  • The law of large numbers connects the two: estimates converge to the true probabilities at rate O(1/\sqrt{n}).
  • The rest of the chapter formalizes random variables, expectations, and joint / conditional / marginal distributions — the vocabulary for everything that follows.