Statistics

Estimator Quality

A primer on the language of estimators that ML borrows heavily from:

Estimator — a procedure that takes data and outputs a guess (e.g. sample mean, MLE).
Bias — \mathbb{E}[\hat\theta] - \theta. Systematic error.
Variance — \text{Var}(\hat\theta). Noise across datasets.
MSE = bias^2 + variance — the basic decomposition that explains overfitting and why regularization helps.

This deck makes the bias-variance tradeoff concrete.

Evaluating estimators

An estimator is judged by its sampling distribution: repeat the same experiment on fresh datasets and ask where the estimates center and how widely they vary.

from d2l import tensorflow as d2l
import tensorflow as tf

tf.pi = tf.acos(tf.zeros(1)) * 2  # define pi in TensorFlow

# Sample datapoints and create y coordinate
epsilon = 0.1
tf.random.set_seed(8675309)
xs = tf.random.normal((300,))

ys = tf.constant(
    [(tf.reduce_sum(tf.exp(-(xs[:i] - xs[i])**2 / (2 * epsilon**2)) \
               / tf.sqrt(2*tf.pi*epsilon**2)) / tf.cast(
        tf.size(xs), dtype=tf.float32)).numpy() \
     for i in range(tf.size(xs))])

# Compute true density
xd = tf.range(tf.reduce_min(xs), tf.reduce_max(xs), 0.01)
yd = tf.exp(-xd**2/2) / tf.sqrt(2 * tf.pi)

# Plot the results
d2l.plot(xd, yd, 'x', 'density')
d2l.plt.scatter(xs, ys)
d2l.plt.axvline(x=0)
d2l.plt.axvline(x=tf.reduce_mean(xs), linestyle='--', color='purple')
d2l.plt.title(f'sample mean: {float(tf.reduce_mean(xs).numpy()):.2f}')
d2l.plt.show()

Empirical bias / variance

Simulate a sampling distribution: many datasets → many estimates → empirical mean and spread:

# Statistical bias
def stat_bias(true_theta, est_theta):
    return(tf.reduce_mean(est_theta) - true_theta)

# Mean squared error
def mse(data, true_theta):
    return(tf.reduce_mean(tf.square(data - true_theta)))

theta_true = 1
sigma = 4
sample_len = 10000
samples = tf.random.normal((sample_len, 1), theta_true, sigma)
theta_est = tf.reduce_mean(samples)
theta_est

<tf.Tensor: shape=(), dtype=float32, numpy=1.018127679824829>

Empirical bias / variance (cont.)

The second pass turns simulated estimates into empirical bias, variance, and MSE, making the bias-variance decomposition visible.

mse(samples, theta_true)

<tf.Tensor: shape=(), dtype=float32, numpy=15.6053466796875>

bias = stat_bias(theta_true, theta_est)
tf.square(tf.math.reduce_std(samples)) + tf.square(bias)

<tf.Tensor: shape=(), dtype=float32, numpy=15.605347633361816>

A Gaussian example

Sample mean for \mathcal{N}(\mu, \sigma^2): unbiased, variance \sigma^2/n. Concretely visualize this:

# Number of samples
N = 1000

# Sample dataset
samples = tf.random.normal((N,), 0, 1)

# Lookup Students's t-distribution c.d.f.
t_star = 1.96

# Construct interval
mu_hat = tf.reduce_mean(samples)
sigma_hat = tf.math.reduce_std(samples)
(mu_hat - t_star*sigma_hat/tf.sqrt(tf.constant(N, dtype=tf.float32)), \
 mu_hat + t_star*sigma_hat/tf.sqrt(tf.constant(N, dtype=tf.float32)))

(<tf.Tensor: shape=(), dtype=float32, numpy=-0.08490262180566788>,
 <tf.Tensor: shape=(), dtype=float32, numpy=0.043600596487522125>)

Recap

Estimator quality = bias + variance.
Sample mean is BLUE for \mu — best linear unbiased estimator under iid Gaussian noise.
Regularization trades a bit of bias for a lot of variance reduction.
Same trade-off shows up everywhere: dropout, weight decay, ensembling.