Looking at the data

Naive Bayes

Naive Bayes Classification

Naive Bayes — the simplest probabilistic classifier. Apply Bayes’ rule:

P(y \mid \mathbf{x}) \propto P(y) \prod_i P(x_i \mid y).

The “naive” part is the assumption that features are conditionally independent given the class. Wrong in general — pixels of an image are obviously correlated — but the model is fast, requires little data, and is a useful starting point.

This deck applies it to MNIST digit classification with binarized pixels.

Setup + binary MNIST

Binarize pixels so each pixel can be modeled as a Bernoulli random variable conditioned on the digit class.

%matplotlib inline
from d2l import mxnet as d2l
import math
from mxnet import gluon, np, npx
npx.set_np()
d2l.use_svg_display()

Inspect the binarized digits before fitting: the class templates are recognizable, but neighboring pixels are clearly dependent.

def transform(data, label):
    return np.floor(data.astype('float32') / 128).squeeze(axis=-1), label

# In Gluon 2.0, `transform=` on the dataset constructor was deprecated in favor
# of `dataset.transform(...)` so that transforms compose cleanly with DataLoader.
mnist_train = gluon.data.vision.MNIST(train=True).transform(transform)
mnist_test = gluon.data.vision.MNIST(train=False).transform(transform)

image, label = mnist_train[2]
image.shape, label

image.shape, image.dtype

Per-class pixel statistics

For each class y and pixel i, estimate P(x_i = 1 \mid y) from the training set. With Laplace smoothing to avoid zeros:

label, type(label), label.dtype

images, labels = mnist_train[10:38]
images.shape, labels.shape

d2l.show_images(images, 2, 9);

Training: just count

Training is counting, not gradient descent: estimate class priors and per-pixel likelihoods directly from the labeled examples.

X, Y = mnist_train[:]  # All training examples

n_y = np.zeros((10))
for y in range(10):
    n_y[y] = (Y == y).sum()
P_y = n_y / n_y.sum()
P_y

n_x = np.zeros((10, 28, 28))
for y in range(10):
    n_x[y] = np.array(X.asnumpy()[Y.asnumpy() == y].sum(axis=0))
P_xy = (n_x + 1) / (n_y + 2).reshape(10, 1, 1)

d2l.show_images(P_xy, 2, 5);

Training (cont.)

Training stores only class priors and per-class pixel probabilities; prediction multiplies those likelihood terms, usually in log-space.

def bayes_pred(x):
    x = np.expand_dims(x, axis=0)  # (28, 28) -> (1, 28, 28)
    p_xy = P_xy * x + (1 - P_xy)*(1 - x)
    p_xy = p_xy.reshape(10, -1).prod(axis=1)  # p(x|y)
    return np.array(p_xy) * P_y

image, label = mnist_test[0]
bayes_pred(image)

a = 0.1
print('underflow:', a**784)
print('logarithm is normal:', 784*math.log(a))

Predicting in log-space

Sums of logs instead of products of probabilities — avoids underflow:

log_P_xy = np.log(P_xy)
log_P_xy_neg = np.log(1 - P_xy)
log_P_y = np.log(P_y)

def bayes_pred_stable(x):
    x = np.expand_dims(x, axis=0)  # (28, 28) -> (1, 28, 28)
    p_xy = log_P_xy * x + log_P_xy_neg * (1 - x)
    p_xy = p_xy.reshape(10, -1).sum(axis=1)  # p(x|y)
    return p_xy + log_P_y

py = bayes_pred_stable(image)
py

# Convert label which is a scalar tensor of int32 dtype to a Python scalar
# integer for comparison
py.argmax(axis=0) == int(label)

Evaluating

The accuracy is useful mostly as a sanity check: on images, the conditional-independence assumption leaves visible performance on the table.

def predict(X):
    return [bayes_pred_stable(x).argmax(axis=0).astype(np.int32) for x in X]

X, y = mnist_test[:18]
preds = predict(X)
d2l.show_images(X, 2, 9, titles=[str(d) for d in preds]);

X, y = mnist_test[:]
preds = np.array(predict(X), dtype=np.int32)
float((preds == y).sum()) / len(y)  # Validation accuracy

Recap

Bayes rule + conditional independence = naive Bayes.
Training is one pass over the data — count and smooth.
Surprisingly competitive baseline for text classification (sparse features, large vocab).
Bad on images (independence is too wrong) — but a great teaching example for Bayesian classification.