Gaussian Process Priors

Gaussian processes = a probability distribution over functions. Pick any finite set of input points \{x_1, \ldots, x_n\}; the values \{f(x_1), \ldots, f(x_n)\} are jointly Gaussian:

f(\mathbf{x}) \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}')).

Specified by a mean function m (often 0) and a covariance kernel k. Sampling from a GP at finitely many points = sampling from a multivariate Gaussian whose covariance matrix is filled in by k.

GPs are nonparametric — model capacity grows with the data. Closed-form Bayesian regression for free; principled uncertainty everywhere. The substrate of Bayesian optimization, active learning, and uncertainty quantification.

Setup

from d2l import torch as d2l
import numpy as np
from scipy.spatial import distance_matrix

d2l.set_figsize()

Sampling from a simple GP

Pick a kernel, evaluate it at a grid → covariance matrix → multivariate normal sample. Each sample is a smooth function:

def lin_func(x, n_sample):
    preds = np.zeros((n_sample, x.shape[0]))
    for ii in range(n_sample):
        w = np.random.normal(0, 1, 2)
        y = w[0] + w[1] * x
        preds[ii, :] = y
    return preds

x_points = np.linspace(-5, 5, 50)
outs = lin_func(x_points, 10)
lw_bd = -2 * np.sqrt((1 + x_points ** 2))
up_bd = 2 * np.sqrt((1 + x_points ** 2))

d2l.plt.fill_between(x_points, lw_bd, up_bd, alpha=0.25)
d2l.plt.plot(x_points, np.zeros(len(x_points)), linewidth=4, color='black')
d2l.plt.plot(x_points, outs.T)
d2l.plt.xlabel("x", fontsize=20)
d2l.plt.ylabel("f(x)", fontsize=20)
d2l.plt.show()

The RBF kernel

k(x, x') = \sigma^2 \exp\!\left(-\frac{(x - x')^2}{2 \ell^2}\right).

Two hyperparameters: amplitude \sigma and length-scale \ell. Long \ell → smooth functions; short \ell → wiggly functions. Most-used kernel for general-purpose regression:

def rbfkernel(x1, x2, ls=4.):
    dist = distance_matrix(np.expand_dims(x1, 1), np.expand_dims(x2, 1))
    return np.exp(-(1. / ls**2 / 2) * (dist ** 2))

x_points = np.linspace(0, 5, 50)
meanvec = np.zeros(len(x_points))
covmat = rbfkernel(x_points,x_points, 1)

prior_samples= np.random.multivariate_normal(meanvec, covmat, size=5);
d2l.plt.plot(x_points, prior_samples.T, alpha=0.5)
d2l.plt.show()

Neural network kernels

An infinitely wide one-hidden-layer Bayesian neural network becomes a Gaussian process:

f(x) = b + \sum_{i=1}^{J} v_i h(x; u_i).

With zero-mean independent weights, finite collections of function values are jointly Gaussian, with

m(x)=E[f(x)]=0, k(x,x')=E[f(x)f(x')].

RBF kernels are stationary: behavior depends only on x-x'. Neural network kernels are non-stationary, so the function class can change across input space.

Recap

GP = distribution over functions defined by mean + kernel.
Finite-dim marginal = multivariate Gaussian; that’s what makes GPs tractable.
Kernel design = function-class design (smoothness, periodicity, ARD, deep kernels, …).
Posterior inference (next deck) is closed-form Gaussian arithmetic.