Entropy

Information Theory

Information Theory for Learning

Information theory (Shannon, 1948) gives the right language for many things in deep learning:

Self-information I(x) = -\log p(x) — surprise of observing x.
Entropy H(X) = -\mathbb{E}[\log p(X)] — expected surprise of a distribution.
Cross-entropy H(p, q) = -\mathbb{E}_{p}[\log q] — what we minimize during classification.
KL divergence D_{KL}(p \| q) = H(p, q) - H(p) — “extra bits” needed to encode p using q.
Mutual information — how much knowing X reduces uncertainty about Y.

Cross-entropy loss = KL between true and predicted distributions, up to a constant.

Self-information

Rare events carry more information than common ones. The log base only chooses the unit: bits for base 2, nats for base e.

from mxnet import np
from mxnet.gluon.metric import CrossEntropy
from mxnet.ndarray import nansum
import random

def self_information(p):
    return -np.log2(p)

self_information(1 / 64)

H(X) = -\sum_x p(x) \log p(x). Maximum at uniform distribution; zero at point masses:

def entropy(p):
    entropy = - p * np.log2(p)
    # Operator `nansum` will sum up the non-nan number
    out = nansum(entropy.as_nd_ndarray())
    return out

entropy(np.array([0.1, 0.5, 0.1, 0.3]))

Joint and conditional entropy

H(X, Y), H(X \mid Y) — and the chain rule H(X, Y) = H(X) + H(Y \mid X):

def joint_entropy(p_xy):
    joint_ent = -p_xy * np.log2(p_xy)
    # Operator `nansum` will sum up the non-nan number
    out = nansum(joint_ent.as_nd_ndarray())
    return out

joint_entropy(np.array([[0.1, 0.5], [0.1, 0.3]]))

def conditional_entropy(p_xy, p_x):
    p_y_given_x = p_xy/p_x
    cond_ent = -p_xy * np.log2(p_y_given_x)
    # Operator `nansum` will sum up the non-nan number
    out = nansum(cond_ent.as_nd_ndarray())
    return out

conditional_entropy(np.array([[0.1, 0.5], [0.2, 0.3]]), np.array([0.2, 0.8]))

Mutual information

I(X; Y) = H(X) - H(X \mid Y) = H(X) + H(Y) - H(X, Y) — how much X and Y share. Symmetric, non-negative, zero iff independent:

def mutual_information(p_xy, p_x, p_y):
    p = p_xy / (p_x * p_y)
    mutual = p_xy * np.log2(p)
    # Operator `nansum` will sum up the non-nan number
    out = nansum(mutual.as_nd_ndarray())
    return out

mutual_information(np.array([[0.1, 0.5], [0.1, 0.3]]),
                   np.array([0.2, 0.8]), np.array([[0.75, 0.25]]))

KL divergence

D_{KL}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)} \ge 0. Asymmetric (not a metric); zero iff p = q:

def kl_divergence(p, q):
    kl = p * np.log2(p / q)
    out = nansum(kl.as_nd_ndarray())
    return out.abs().asscalar()

Examples

Small distributions make the abstractions concrete: entropy grows with uncertainty, while KL is zero only when the distributions match.

random.seed(1)

nd_len = 10000
p = np.random.normal(loc=0, scale=1, size=(nd_len, ))
q1 = np.random.normal(loc=-1, scale=1, size=(nd_len, ))
q2 = np.random.normal(loc=1, scale=1, size=(nd_len, ))

p = np.array(sorted(p.asnumpy()))
q1 = np.array(sorted(q1.asnumpy()))
q2 = np.array(sorted(q2.asnumpy()))

kl_pq1 = kl_divergence(p, q1)
kl_pq2 = kl_divergence(p, q2)
similar_percentage = abs(kl_pq1 - kl_pq2) / ((kl_pq1 + kl_pq2) / 2) * 100

kl_pq1, kl_pq2, similar_percentage

kl_q2p = kl_divergence(q2, p)
differ_percentage = abs(kl_q2p - kl_pq2) / ((kl_q2p + kl_pq2) / 2) * 100

kl_q2p, differ_percentage

Formal definitions

Entropy, cross-entropy, and KL differ by which distribution supplies the expectation and which log-probability is scored.

def cross_entropy(y_hat, y):
    ce = -np.log(y_hat[range(len(y_hat)), y])
    return ce.mean()

labels = np.array([0, 2])
preds = np.array([[0.3, 0.6, 0.1], [0.2, 0.3, 0.5]])

cross_entropy(preds, labels)

Cross-entropy in classification

Multi-class classification: data distribution = one-hot on the true class; model = softmax. Cross-entropy = NLL of the true class:

\mathcal{L} = -\sum_i \log q(y_i \mid x_i).

nll_loss = CrossEntropy()
# MX 2.0's CrossEntropy operates on np arrays directly (it calls
# `pred.to_device(label.device)` internally); no `as_nd_ndarray()` cast.
nll_loss.update([labels], [preds])
nll_loss.get()

Recap

Entropy: expected surprise; KL: extra bits; cross-entropy: KL + entropy.
Most DL classification = minimizing cross-entropy = minimizing KL to the empirical distribution.
Mutual information appears in InfoNCE / contrastive learning, the IB principle, and many others.