Entropy
H(X) = -\sum_x p(x) \log p(x). Maximum at uniform distribution; zero at point masses:
Joint and conditional entropy
H(X, Y), H(X \mid Y) — and the chain rule H(X, Y) = H(X) + H(Y \mid X):
KL divergence
D_{KL}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)} \ge 0. Asymmetric (not a metric); zero iff p = q:
Examples
Small distributions make the abstractions concrete: entropy grows with uncertainty, while KL is zero only when the distributions match.
(8582.0341796875, 8828.3095703125, 2.8290698237936858)
(14130.125, 46.18621024399691)
Cross-entropy in classification
Multi-class classification: data distribution = one-hot on the true class; model = softmax. Cross-entropy = NLL of the true class:
\mathcal{L} = -\sum_i \log q(y_i \mid x_i).
Recap
- Entropy: expected surprise; KL: extra bits; cross-entropy: KL + entropy.
- Most DL classification = minimizing cross-entropy = minimizing KL to the empirical distribution.
- Mutual information appears in InfoNCE / contrastive learning, the IB principle, and many others.