The Base Classification Model

Dive into Deep Learning · §3.3

The base classification model
One forward pass, read two ways: a loss to train on, an accuracy to report, and what to do when accuracy lies.

One forward pass, two readings

Motivation

A classifier scores the classes, then the picture forks:

train on a smooth loss that gradient descent can minimize;
report a hard accuracy, the number benchmarks care about.

We collect both, once, in a Classifier base class so every model in the book inherits them for free.

One forward pass produces the logits \mathbf{o}; the top branch softmaxes them and reads the differentiable loss, the bottom branch takes the \arg\max to a decision and counts it.

The Classifier base class

what every model inherits, what each supplies

Inherit the loop, supply the model

The base class

Classifier extends the d2l.Module scaffold from the regression chapter, adding classification defaults.

Inherited: a validation step (loss + accuracy) and a default optimizer.
Supplied by a subclass: its forward pass, and a loss only if plain cross-entropy will not do.

Same payoff as Module itself: write the model-specific part once, get the training and evaluation machinery for free.

Validation reports loss and accuracy

The base class

The override logs two curves per validation batch, where regression logged one:

class Classifier(d2l.Module):
    """The base class of classification models."""
    def validation_step(self, batch):
        Y_hat = self(*batch[:-1])
        self.plot('loss', self.loss(Y_hat, batch[-1]), train=False)
        self.plot('acc', self.accuracy(Y_hat, batch[-1]), train=False)

    def _report_val(self, y_hat, batch):
        self.plot('loss', self.loss(y_hat, batch[-1]), train=False)
        self.plot('acc', self.accuracy(y_hat, batch[-1]), train=False)

Averaging over num_val_batches is slightly off on a short last batch; we ignore that to keep the code simple.

A default optimizer, installed once

The base class

configure_optimizers is a hook the Trainer calls at startup. We put plain minibatch SGD on Module itself, so no subclass repeats it (later chapters override to switch optimizers):

@d2l.add_to_class(d2l.Module)
def configure_optimizers(self):
    return tf.keras.optimizers.SGD(float(self.lr))

Accuracy

the hard-decision metric, in four lines

Why a classifier needs two numbers

Scores, loss, decision

The same logits \mathbf{o} feed two branches with different jobs.

. . .

Loss (top) softmaxes to probabilities and is differentiable, so it trains the model, and keeps rewarding confidence past the point the decision is right.

. . .

Accuracy (bottom) is \arg\max then compare: a discrete count whose gradient is zero almost everywhere, so it cannot be optimized directly.

Logits (1.0, 2.2, 0.3): softmax \to (0.21, 0.69, 0.10) and cross-entropy \ell=0.37 on top; \arg\max=1 matches y=1, one correct, on the bottom.

Accuracy in four lines

Scores, loss, decision

argmax along the class axis, compare with the label element-wise, average the 0/1 hits:

@d2l.add_to_class(Classifier)
def accuracy(self, Y_hat, Y, averaged=True):
    """Compute the fraction of correct predictions."""
    Y_hat = d2l.reshape(Y_hat, (-1, Y_hat.shape[-1]))
    preds = d2l.astype(d2l.argmax(Y_hat, axis=1), Y.dtype)
    compare = d2l.astype(preds == d2l.reshape(Y, (-1,)), d2l.float32)
    return d2l.reduce_mean(compare) if averaged else compare

The astype matches dtypes before ==, since the comparison is type-sensitive.

Report both, for complementary reasons

Scores, loss, decision

Two classifiers can hit the same accuracy while one is confidently right and the other barely so.

Only the loss separates a correct-class probability of 0.51 from 0.99, which is why it, not accuracy, is what we optimize.

When the two disagree (accuracy flat while loss still drops) that is a diagnostic about optimization and calibration (how well predicted probabilities match empirical frequencies), not a bug.

Beyond Accuracy

when the headline number lies

99% accurate, perfectly useless

Beyond Accuracy

Screen for a disease carried by 1% of the population. A “classifier” that ignores its input and always says healthy scores

\textrm{accuracy} = 1 - \frac{\textrm{FP} + \textrm{FN}}{n} = 1 - \frac{0 + 1{,}000}{100{,}000} = \mathbf{0.99}, \qquad \textrm{recall} = \frac{\textrm{TP}}{\textrm{sick}} = \frac{0}{1{,}000} = \mathbf{0.0}.

Accuracy 0.99, recall 0.0: it finds not one sick patient. Accuracy weights every example equally, so under class imbalance it can award a near-perfect score to a model that never does its job.

Precision and recall name the two failure modes

Beyond Accuracy

Break the counts down by predicted \times true: TP, FP, FN, TN. Two ratios summarize the two ways to fail:

\textrm{precision} = \frac{\textrm{TP}}{\textrm{TP} + \textrm{FP}} \qquad\qquad \textrm{recall} = \frac{\textrm{TP}}{\textrm{TP} + \textrm{FN}}

Precision: of those we flagged, how many were real? Recall: of the real positives, how many did we find? The always-healthy screener has recall 0 (precision undefined: it never flags).

One number when you must: the F1 score 2PR/(P{+}R), high only when both are.

The confusion matrix: every error, itemized

Beyond Accuracy

For q classes the same bookkeeping becomes a q \times q confusion matrix: entry (i, j) counts true class j predicted as class i.

The diagonal holds the correct decisions; accuracy is just its normalized trace (the fraction on the diagonal), one number where the matrix keeps q^2.
Every off-diagonal cell isolates one specific kind of error.

This object returns twice: in the softmax-from-scratch section we compute one for our Fashion-MNIST model and read which classes it confuses; in the distribution-shift section the very same matrix is inverted to correct label shift.

Recap

Wrap-up

Classifier(d2l.Module) adds a loss + accuracy validation step and a default SGD optimizer.
A new model supplies only forward (and a custom loss), inheriting the whole loop.
Accuracy = fraction whose \arg\max matches the label: argmax → == y → mean. Discrete, so we train on the loss.

Under imbalance accuracy can lie: always-healthy scores 0.99 with recall 0.0.
Precision / recall split the failure modes; F1 compresses them.
The confusion matrix itemizes all q^2 outcomes, computed in the softmax-from-scratch section, inverted in the distribution-shift section.