%matplotlib inline
from d2l import tensorflow as d2l
import math
import tensorflow as tf
d2l.use_svg_display()Naive Bayes — the simplest probabilistic classifier. Apply Bayes’ rule:
P(y \mid \mathbf{x}) \propto P(y) \prod_i P(x_i \mid y).
The “naive” part is the assumption that features are conditionally independent given the class. Wrong in general — pixels of an image are obviously correlated — but the model is fast, requires little data, and is a useful starting point.
This deck applies it to MNIST digit classification with binarized pixels.
Binarize pixels so each pixel can be modeled as a Bernoulli random variable conditioned on the digit class.
Inspect the binarized digits before fitting: the class templates are recognizable, but neighboring pixels are clearly dependent.
((train_images, train_labels), (
test_images, test_labels)) = tf.keras.datasets.mnist.load_data()
# Original pixel values of MNIST range from 0-255 (as the digits are stored as
# uint8). For this section, pixel values that are greater than 128 (in the
# original image) are converted to 1 and values that are less than 128 are
# converted to 0. See section 18.9.2 and 18.9.3 for why
train_images = tf.floor(tf.constant(train_images / 128, dtype = tf.float32))
test_images = tf.floor(tf.constant(test_images / 128, dtype = tf.float32))
train_labels = tf.constant(train_labels, dtype = tf.int32)
test_labels = tf.constant(test_labels, dtype = tf.int32)(TensorShape([28, 28]), np.int32(4))
For each class y and pixel i, estimate P(x_i = 1 \mid y) from the training set. With Laplace smoothing to avoid zeros:
(np.int32(4), tf.int32)
(TensorShape([28, 28, 28]), TensorShape([28]))
Training is counting, not gradient descent: estimate class priors and per-pixel likelihoods directly from the labeled examples.
<tf.Tensor: shape=(10,), dtype=float32, numpy=
array([0.09871667, 0.11236667, 0.0993 , 0.10218333, 0.09736667,
0.09035 , 0.09863333, 0.10441667, 0.09751666, 0.09915 ],
dtype=float32)>
Training stores only class priors and per-class pixel probabilities; prediction multiplies those likelihood terms, usually in log-space.
<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)>
Sums of logs instead of products of probabilities — avoids underflow:
log_P_xy = tf.math.log(P_xy)
log_P_xy_neg = tf.math.log(1 - P_xy)
log_P_y = tf.math.log(P_y)
def bayes_pred_stable(x):
x = tf.expand_dims(x, axis=0) # (28, 28) -> (1, 28, 28)
p_xy = log_P_xy * x + log_P_xy_neg * (1 - x)
p_xy = tf.math.reduce_sum(tf.reshape(p_xy, (10, -1)), axis=1) # p(x|y)
return p_xy + log_P_y
py = bayes_pred_stable(image)
py<tf.Tensor: shape=(10,), dtype=float32, numpy=
array([-268.9725 , -301.7044 , -245.19516 , -218.87387 , -193.45705 ,
-206.0909 , -292.52264 , -114.625656, -220.33133 , -163.17844 ],
dtype=float32)>
The accuracy is useful mostly as a sanity check: on images, the conditional-independence assumption leaves visible performance on the table.
def predict(X):
return [tf.argmax(
bayes_pred_stable(x), axis=0, output_type = tf.int32).numpy()
for x in X]
X = tf.stack([test_images[i] for i in range(10, 38)], axis=0)
y = tf.constant([test_labels[i].numpy() for i in range(10, 38)])
preds = predict(X)
d2l.show_images(X, 2, 9, titles=[str(d) for d in preds]);