from d2l import torch as d2l
import numpy as np
from scipy.spatial import distance_matrix
d2l.set_figsize()Gaussian processes = a probability distribution over functions. Pick any finite set of input points \{x_1, \ldots, x_n\}; the values \{f(x_1), \ldots, f(x_n)\} are jointly Gaussian:
f(\mathbf{x}) \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}')).
Specified by a mean function m (often 0) and a covariance kernel k. Sampling from a GP at finitely many points = sampling from a multivariate Gaussian whose covariance matrix is filled in by k.
GPs are nonparametric — model capacity grows with the data. Closed-form Bayesian regression for free; principled uncertainty everywhere. The substrate of Bayesian optimization, active learning, and uncertainty quantification.
Pick a kernel, evaluate it at a grid → covariance matrix → multivariate normal sample. Each sample is a smooth function:
def lin_func(x, n_sample):
preds = np.zeros((n_sample, x.shape[0]))
for ii in range(n_sample):
w = np.random.normal(0, 1, 2)
y = w[0] + w[1] * x
preds[ii, :] = y
return preds
x_points = np.linspace(-5, 5, 50)
outs = lin_func(x_points, 10)
lw_bd = -2 * np.sqrt((1 + x_points ** 2))
up_bd = 2 * np.sqrt((1 + x_points ** 2))
d2l.plt.fill_between(x_points, lw_bd, up_bd, alpha=0.25)
d2l.plt.plot(x_points, np.zeros(len(x_points)), linewidth=4, color='black')
d2l.plt.plot(x_points, outs.T)
d2l.plt.xlabel("x", fontsize=20)
d2l.plt.ylabel("f(x)", fontsize=20)
d2l.plt.show()k(x, x') = \sigma^2 \exp\!\left(-\frac{(x - x')^2}{2 \ell^2}\right).
Two hyperparameters: amplitude \sigma and length-scale \ell. Long \ell → smooth functions; short \ell → wiggly functions. Most-used kernel for general-purpose regression:
def rbfkernel(x1, x2, ls=4.):
dist = distance_matrix(np.expand_dims(x1, 1), np.expand_dims(x2, 1))
return np.exp(-(1. / ls**2 / 2) * (dist ** 2))
x_points = np.linspace(0, 5, 50)
meanvec = np.zeros(len(x_points))
covmat = rbfkernel(x_points,x_points, 1)
prior_samples= np.random.multivariate_normal(meanvec, covmat, size=5);
d2l.plt.plot(x_points, prior_samples.T, alpha=0.5)
d2l.plt.show()An infinitely wide one-hidden-layer Bayesian neural network becomes a Gaussian process:
f(x) = b + \sum_{i=1}^{J} v_i h(x; u_i).
With zero-mean independent weights, finite collections of function values are jointly Gaussian, with
m(x)=E[f(x)]=0, k(x,x')=E[f(x)f(x')].
RBF kernels are stationary: behavior depends only on x-x'. Neural network kernels are non-stationary, so the function class can change across input space.