Bernoulli

Distributions

Common Probability Distributions

A reference tour of the distributions used throughout the book — what they look like, when they apply, and how to sample / evaluate them in code.

Bernoulli — single coin flip; binary classification conditional.
Discrete uniform — equiprobable categories.
Continuous uniform — random initialization, dropout masks (in expectation).
Binomial — count of successes in n Bernoullis.
Poisson — rare events count; CTR distributions, click counts.
Gaussian — by far the most-used; CLT, regression noise model, default prior.

Setup

Imports and plotting helpers are shared across the PMF, PDF, CDF, and sampling examples below.

%matplotlib inline
from d2l import mxnet as d2l
from IPython import display
from math import erf, factorial
import numpy as np

P(X=1) = p, P(X=0) = 1-p. Mean p, variance p(1-p):

p = 0.3

d2l.set_figsize()
d2l.plt.stem([0, 1], [1 - p, p])
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()

x = np.arange(-1, 2, 0.01)

def F(x):
    return 0 if x < 0 else 1 if x > 1 else 1 - p

d2l.plot(x, np.array([F(y) for y in x]), 'x', 'c.d.f.')

1*(np.random.rand(10, 10) < p)

Discrete uniform

Equally likely categories. Maximum entropy on a finite set with no prior knowledge:

n = 5

d2l.plt.stem([i+1 for i in range(n)], n*[1 / n])
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()

x = np.arange(-1, 6, 0.01)

def F(x):
    return 0 if x < 1 else 1 if x > n else np.floor(x) / n

d2l.plot(x, np.array([F(y) for y in x]), 'x', 'c.d.f.')

np.random.randint(1, n, size=(10, 10))

Continuous uniform

Density \frac{1}{b-a} on [a, b]. Source of pseudo-random samples for Monte Carlo and dropout:

a, b = 1, 3

x = np.arange(0, 4, 0.01)
p = (x > a)*(x < b)/(b - a)

d2l.plot(x, p, 'x', 'p.d.f.')

def F(x):
    return 0 if x < a else 1 if x > b else (x - a) / (b - a)

d2l.plot(x, np.array([F(y) for y in x]), 'x', 'c.d.f.')

(b - a) * np.random.rand(10, 10) + a

Binomial

Sum of n iid Bernoullis. Bell-shaped for large n (Gaussian limit):

n, p = 10, 0.2

# Compute binomial coefficient
def binom(n, k):
    comb = 1
    for i in range(min(k, n - k)):
        comb = comb * (n - i) // (i + 1)
    return comb

pmf = np.array([p**i * (1-p)**(n - i) * binom(n, i) for i in range(n + 1)])

d2l.plt.stem([i for i in range(n + 1)], pmf)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()

x = np.arange(-1, 11, 0.01)
cmf = np.cumsum(pmf)

def F(x):
    return 0 if x < 0 else 1 if x > n else cmf[int(x)]

d2l.plot(x, np.array([F(y) for y in x.tolist()]), 'x', 'c.d.f.')

np.random.binomial(n, p, size=(10, 10))

Poisson

Rare events: P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}. Approximates binomial with n large, p small, np \to \lambda:

Poisson CDF

The cumulative distribution sums the probability of observing up to k events:

F(k)=P(X \le k).

Poisson samples

Sampling turns the distribution into count data: nonnegative integers with mean and variance both near \lambda.

Gaussian

\mathcal{N}(\mu, \sigma^2) — bell curve. CLT makes it the limit of many small contributions; that’s why it’s everywhere:

p = 0.2
ns = [1, 10, 100, 1000]
d2l.plt.figure(figsize=(10, 3))
for i in range(4):
    n = ns[i]
    pmf = np.array([p**i * (1-p)**(n-i) * binom(n, i) for i in range(n + 1)])
    d2l.plt.subplot(1, 4, i + 1)
    d2l.plt.stem([(i - n*p)/np.sqrt(n*p*(1 - p)) for i in range(n + 1)], pmf)
    d2l.plt.xlim([-4, 4])
    d2l.plt.xlabel('x')
    d2l.plt.ylabel('p.m.f.')
    d2l.plt.title("n = {}".format(n))
d2l.plt.show()

mu, sigma = 0, 1

x = np.arange(-3, 3, 0.01)
p = 1 / np.sqrt(2 * np.pi * sigma**2) * np.exp(-(x - mu)**2 / (2 * sigma**2))

d2l.plot(x, p, 'x', 'p.d.f.')

Gaussian (cont.)

Changing \mu shifts the bell curve; changing \sigma spreads it. Samples concentrate near the mean and thin out in the tails.

def phi(x):
    return (1.0 + erf((x - mu) / (sigma * np.sqrt(2)))) / 2.0

d2l.plot(x, np.array([phi(y) for y in x.tolist()]), 'x', 'c.d.f.')

np.random.normal(mu, sigma, size=(10, 10))

Recap

A small toolkit covers most needs: Bernoulli, uniform (discrete/continuous), binomial, Poisson, Gaussian.
CLT makes the Gaussian central — sums of many small effects look Gaussian.
Each distribution has a closed-form NLL → standard loss in DL.