Entropy
H(X) = -\sum_x p(x) \log p(x). Maximum at uniform distribution; zero at point masses:
<tf.Tensor: shape=(), dtype=float32, numpy=1.6854753494262695>
Joint and conditional entropy
H(X, Y), H(X \mid Y) — and the chain rule H(X, Y) = H(X) + H(Y \mid X):
<tf.Tensor: shape=(2,), dtype=float32, numpy=array([0.8321928, 0.8532826], dtype=float32)>
<tf.Tensor: shape=(2,), dtype=float32, numpy=array([0.43903595, 0.42451128], dtype=float32)>
KL divergence
D_{KL}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)} \ge 0. Asymmetric (not a metric); zero iff p = q:
Examples
Small distributions make the abstractions concrete: entropy grows with uncertainty, while KL is zero only when the distributions match.
(np.float32(8667.229), np.float32(8781.478), np.float32(1.3095415))
(np.float32(13625.918), np.float32(43.23966))
Cross-entropy in classification
Multi-class classification: data distribution = one-hot on the true class; model = softmax. Cross-entropy = NLL of the true class:
\mathcal{L} = -\sum_i \log q(y_i \mid x_i).
Recap
- Entropy: expected surprise; KL: extra bits; cross-entropy: KL + entropy.
- Most DL classification = minimizing cross-entropy = minimizing KL to the empirical distribution.
- Mutual information appears in InfoNCE / contrastive learning, the IB principle, and many others.