A fair coin, 100 tosses

Probability and Statistics

Probability for Learning

Most of machine learning is inference under uncertainty:

  • Models output distributions over labels, not labels.
  • Losses are negative log-likelihoods.
  • Generalization, regularization, and Bayesian methods all rest on probability.

The chapter’s running example: tossing a fair coin. As the sample count grows, empirical frequencies converge to the true P = 0.5:

The standard d2l prelude (plus a multinomial distribution we’ll use shortly):

%matplotlib inline
from d2l import mxnet as d2l
from mxnet import np, npx
from mxnet.numpy.random import multinomial
import random
npx.set_np()

We can simulate coin flips with random.random():

num_tosses = 100
heads = sum([random.random() > 0.5 for _ in range(num_tosses)])
tails = num_tosses - heads
print("heads, tails: ", [heads, tails])

The split is near 50/50 but not exactly — sampling has variance.

Sampling from a distribution

A cleaner abstraction: a Multinomial over the categories {heads, tails} with probabilities [0.5, 0.5]. One call returns the count vector for 100 tosses:

fair_probs = [0.5, 0.5]
multinomial(100, fair_probs)

Divide by the trial count to get empirical frequencies — estimates of P(\text{heads}) and P(\text{tails}):

multinomial(100, fair_probs) / 100

More tosses, tighter estimate

With 10 000 tosses, the empirical frequencies sit much closer to 0.5:

counts = multinomial(10000, fair_probs).astype(np.float32)
counts / 10000

This is the law of large numbers: as n \to \infty the empirical mean converges to the true mean.

Convergence in pictures

Plot the running estimate of P(\text{heads}) and P(\text{tails}) vs. sample count — the curves zigzag toward 0.5:

counts = multinomial(1, fair_probs, size=10000)
cum_counts = counts.astype(np.float32).cumsum(axis=0)
estimates = cum_counts / cum_counts.sum(axis=1, keepdims=True)
d2l.set_figsize((4.5, 3.5))
d2l.plt.plot(estimates[:, 0], label=("P(coin=heads)"))
d2l.plt.plot(estimates[:, 1], label=("P(coin=tails)"))
d2l.plt.axhline(y=0.5, color='black', linestyle='dashed')
d2l.plt.gca().set_xlabel('Samples')
d2l.plt.gca().set_ylabel('Estimated probability')
d2l.plt.legend();

The variance of the estimate shrinks like 1/\sqrt{n} — doubling accuracy means quadrupling the sample budget.

Recap

  • A probability distribution assigns mass to events.
  • Sampling + counting = empirical frequencies.
  • The law of large numbers connects the two: estimates converge to the true probabilities at rate O(1/\sqrt{n}).
  • The rest of the chapter formalizes random variables, expectations, and joint / conditional / marginal distributions — the vocabulary for everything that follows.