import jax
from jax import numpy as jnp
import numpy as np
def nansum(x):
return jnp.nansum(x)
def self_information(p):
return -jnp.log2(jnp.array(p)).item()
self_information(1 / 64)Information theory (Shannon, 1948) gives the right language for many things in deep learning:
Cross-entropy loss = KL between true and predicted distributions, up to a constant.
Rare events carry more information than common ones. The log base only chooses the unit: bits for base 2, nats for base e.
6.0
H(X) = -\sum_x p(x) \log p(x). Maximum at uniform distribution; zero at point masses:
Array(1.6854753, dtype=float32)
H(X, Y), H(X \mid Y) — and the chain rule H(X, Y) = H(X) + H(Y \mid X):
Array(1.6854753, dtype=float32)
I(X; Y) = H(X) - H(X \mid Y) = H(X) + H(Y) - H(X, Y) — how much X and Y share. Symmetric, non-negative, zero iff independent:
Array(0.71946037, dtype=float32)
D_{KL}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)} \ge 0. Asymmetric (not a metric); zero iff p = q:
Small distributions make the abstractions concrete: entropy grows with uncertainty, while KL is zero only when the distributions match.
(8675.0947265625, 8916.708984375, 2.7468957905922875)
Entropy, cross-entropy, and KL differ by which distribution supplies the expectation and which log-probability is scored.
Multi-class classification: data distribution = one-hot on the true class; model = softmax. Cross-entropy = NLL of the true class:
\mathcal{L} = -\sum_i \log q(y_i \mid x_i).
Array(0.94856, dtype=float32)