Information Theory

Information Theory for Learning

Information theory (Shannon, 1948) gives the right language for many things in deep learning:

Self-information I(x) = -\log p(x) — surprise of observing x.
Entropy H(X) = -\mathbb{E}[\log p(X)] — expected surprise of a distribution.
Cross-entropy H(p, q) = -\mathbb{E}_{p}[\log q] — what we minimize during classification.
KL divergence D_{KL}(p \| q) = H(p, q) - H(p) — “extra bits” needed to encode p using q.
Mutual information — how much knowing X reduces uncertainty about Y.

Cross-entropy loss = KL between true and predicted distributions, up to a constant.

Self-information

Rare events carry more information than common ones. The log base only chooses the unit: bits for base 2, nats for base e.

import tensorflow as tf

def log2(x):
    return tf.math.log(x) / tf.math.log(2.)

def nansum(x):
    return tf.reduce_sum(tf.where(tf.math.is_nan(
        x), tf.zeros_like(x), x), axis=-1)

def self_information(p):
    return -log2(tf.constant(p)).numpy()

self_information(1 / 64)

np.float32(6.0)

Entropy

H(X) = -\sum_x p(x) \log p(x). Maximum at uniform distribution; zero at point masses:

def entropy(p):
    return nansum(- p * log2(p))

entropy(tf.constant([0.1, 0.5, 0.1, 0.3]))

<tf.Tensor: shape=(), dtype=float32, numpy=1.6854753494262695>

Joint and conditional entropy

H(X, Y), H(X \mid Y) — and the chain rule H(X, Y) = H(X) + H(Y \mid X):

def joint_entropy(p_xy):
    joint_ent = -p_xy * log2(p_xy)
    # Operator `nansum` will sum up the non-nan number
    out = nansum(joint_ent)
    return out

joint_entropy(tf.constant([[0.1, 0.5], [0.1, 0.3]]))

<tf.Tensor: shape=(2,), dtype=float32, numpy=array([0.8321928, 0.8532826], dtype=float32)>

def conditional_entropy(p_xy, p_x):
    p_y_given_x = p_xy/p_x
    cond_ent = -p_xy * log2(p_y_given_x)
    # Operator `nansum` will sum up the non-nan number
    out = nansum(cond_ent)
    return out

conditional_entropy(tf.constant([[0.1, 0.5], [0.2, 0.3]]),
                    tf.constant([0.2, 0.8]))

<tf.Tensor: shape=(2,), dtype=float32, numpy=array([0.43903595, 0.42451128], dtype=float32)>

Mutual information

I(X; Y) = H(X) - H(X \mid Y) = H(X) + H(Y) - H(X, Y) — how much X and Y share. Symmetric, non-negative, zero iff independent:

def mutual_information(p_xy, p_x, p_y):
    p = p_xy / (p_x * p_y)
    mutual = p_xy * log2(p)
    # Operator `nansum` will sum up the non-nan number
    out = nansum(mutual)
    return out

mutual_information(tf.constant([[0.1, 0.5], [0.1, 0.3]]),
                   tf.constant([0.2, 0.8]), tf.constant([[0.75, 0.25]]))

<tf.Tensor: shape=(2,), dtype=float32, numpy=array([0.60246783, 0.1169925 ], dtype=float32)>

KL divergence

D_{KL}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)} \ge 0. Asymmetric (not a metric); zero iff p = q:

def kl_divergence(p, q):
    kl = p * log2(p / q)
    out = nansum(kl)
    return tf.abs(out).numpy()

Examples

Small distributions make the abstractions concrete: entropy grows with uncertainty, while KL is zero only when the distributions match.

tensor_len = 10000
p = tf.random.normal((tensor_len, ), 0, 1)
q1 = tf.random.normal((tensor_len, ), -1, 1)
q2 = tf.random.normal((tensor_len, ), 1, 1)

p = tf.sort(p)
q1 = tf.sort(q1)
q2 = tf.sort(q2)

kl_pq1 = kl_divergence(p, q1)
kl_pq2 = kl_divergence(p, q2)
similar_percentage = abs(kl_pq1 - kl_pq2) / ((kl_pq1 + kl_pq2) / 2) * 100

kl_pq1, kl_pq2, similar_percentage

(np.float32(8667.229), np.float32(8781.478), np.float32(1.3095415))

kl_q2p = kl_divergence(q2, p)
differ_percentage = abs(kl_q2p - kl_pq2) / ((kl_q2p + kl_pq2) / 2) * 100

kl_q2p, differ_percentage

(np.float32(13625.918), np.float32(43.23966))

Formal definitions

Entropy, cross-entropy, and KL differ by which distribution supplies the expectation and which log-probability is scored.

def cross_entropy(y_hat, y):
    # `tf.gather_nd` is used to select specific indices of a tensor.
    ce = -tf.math.log(tf.gather_nd(y_hat, indices = [[i, j] for i, j in zip(
        range(len(y_hat)), y)]))
    return tf.reduce_mean(ce).numpy()

labels = tf.constant([0, 2])
preds = tf.constant([[0.3, 0.6, 0.1], [0.2, 0.3, 0.5]])

cross_entropy(preds, labels)

np.float32(0.94856)

Cross-entropy in classification

Multi-class classification: data distribution = one-hot on the true class; model = softmax. Cross-entropy = NLL of the true class:

\mathcal{L} = -\sum_i \log q(y_i \mid x_i).

def nll_loss(y_hat, y):
    # Convert labels to one-hot vectors.
    y = tf.keras.utils.to_categorical(y, num_classes= y_hat.shape[1])
    # We will not calculate negative log-likelihood from the definition.
    # Rather, we will follow a circular argument. Because NLL is same as
    # `cross_entropy`, if we calculate cross_entropy that would give us NLL
    cross_entropy = tf.keras.losses.CategoricalCrossentropy(
        from_logits = True, reduction = tf.keras.losses.Reduction.NONE)
    return tf.reduce_mean(cross_entropy(y, y_hat)).numpy()

loss = nll_loss(tf.math.log(preds), labels)
loss

np.float32(0.94856)

Recap

Entropy: expected surprise; KL: extra bits; cross-entropy: KL + entropy.
Most DL classification = minimizing cross-entropy = minimizing KL to the empirical distribution.
Mutual information appears in InfoNCE / contrastive learning, the IB principle, and many others.