from mxnet import np
from mxnet.gluon.metric import CrossEntropy
from mxnet.ndarray import nansum
import random
def self_information(p):
return -np.log2(p)
self_information(1 / 64)Information theory (Shannon, 1948) gives the right language for many things in deep learning:
Cross-entropy loss = KL between true and predicted distributions, up to a constant.
Rare events carry more information than common ones. The log base only chooses the unit: bits for base 2, nats for base e.
H(X) = -\sum_x p(x) \log p(x). Maximum at uniform distribution; zero at point masses:
H(X, Y), H(X \mid Y) — and the chain rule H(X, Y) = H(X) + H(Y \mid X):
I(X; Y) = H(X) - H(X \mid Y) = H(X) + H(Y) - H(X, Y) — how much X and Y share. Symmetric, non-negative, zero iff independent:
D_{KL}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)} \ge 0. Asymmetric (not a metric); zero iff p = q:
Small distributions make the abstractions concrete: entropy grows with uncertainty, while KL is zero only when the distributions match.
random.seed(1)
nd_len = 10000
p = np.random.normal(loc=0, scale=1, size=(nd_len, ))
q1 = np.random.normal(loc=-1, scale=1, size=(nd_len, ))
q2 = np.random.normal(loc=1, scale=1, size=(nd_len, ))
p = np.array(sorted(p.asnumpy()))
q1 = np.array(sorted(q1.asnumpy()))
q2 = np.array(sorted(q2.asnumpy()))Entropy, cross-entropy, and KL differ by which distribution supplies the expectation and which log-probability is scored.
Multi-class classification: data distribution = one-hot on the true class; model = softmax. Cross-entropy = NLL of the true class:
\mathcal{L} = -\sum_i \log q(y_i \mid x_i).