Softmax Regression Implementation from Scratch

Dive into Deep Learning · §3.4

Softmax regression from scratch
The whole classifier, opened up: the softmax, the cross-entropy loss, and the training loop, each built by hand.

The same recipe, two new pieces

Motivation

Linear regression mapped inputs to one number. A classifier maps them to a distribution over classes. Two new parts do that:

Softmax turns raw scores (logits) into probabilities.
Cross-entropy is the loss that scores a distribution.

Everything else, the Module / Trainer scaffold, is reused from the regression chapter; Classifier just adds accuracy reporting.

The Softmax

from scores to a probability distribution

First, a reminder: sums along an axis

The Softmax

Softmax normalizes each row, so we need a per-row sum. axis=1 collapses the columns; keepdims holds the shape for broadcasting:

X = d2l.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
d2l.reduce_sum(X, 0, keepdims=True), d2l.reduce_sum(X, 1, keepdims=True)

(<tf.Tensor: shape=(1, 3), dtype=float32, numpy=array([[5., 7., 9.]], dtype=float32)>,
 <tf.Tensor: shape=(2, 1), dtype=float32, numpy=
 array([[ 6.],
        [15.]], dtype=float32)>)

axis=0 sums down columns, axis=1 sums across rows. keepdims=True keeps a length-1 axis so the result still broadcasts against X.

Softmax: exponentiate, sum, divide

The Softmax

\mathrm{softmax}(\mathbf{X})_{ij} = \frac{\exp(\mathbf{X}_{ij})}{\sum_k \exp(\mathbf{X}_{ik})}.

Three steps: exponentiate every score, sum across the class axis, divide each row by its total:

def softmax(X):
    X_exp = d2l.exp(X)
    partition = d2l.reduce_sum(X_exp, 1, keepdims=True)
    return X_exp / partition  # The broadcasting mechanism is applied here

Naive exp overflows for large logits. Fine for teaching; never use it in production. The stable fix arrives in the concise-softmax-regression section.

The output is a real distribution

The Softmax

Feed in any matrix: every entry becomes non-negative and each row sums to 1, exactly what a probability distribution over classes requires:

X = d2l.rand((2, 5))
X_prob = softmax(X)
X_prob, d2l.reduce_sum(X_prob, 1)

(<tf.Tensor: shape=(2, 5), dtype=float32, numpy=
 array([[0.21319394, 0.13474208, 0.20669897, 0.17184985, 0.27351508],
        [0.18383098, 0.17291924, 0.17053534, 0.20444588, 0.26826856]],
       dtype=float32)>,
 <tf.Tensor: shape=(2,), dtype=float32, numpy=array([0.9999999, 1.       ], dtype=float32)>)

One logit of 1000, and the answer is NaN

The Softmax · numerical failure

A single logit of 1000 sends \exp to infinity in float32, so the max entry becomes \infty/\infty= NaN (and the rest underflow to 0), poisoning the row. The framework’s softmax shifts by the row maximum first and stays finite on the identical input:

z = tf.constant([1000., 0., 0.])
naive = tf.exp(z) / tf.reduce_sum(tf.exp(z))  # exp(1000) overflows -> nan
stable = tf.nn.softmax(z, axis=0)             # built-in uses the log-sum-exp trick
naive, stable

(<tf.Tensor: shape=(3,), dtype=float32, numpy=array([nan,  0.,  0.], dtype=float32)>,
 <tf.Tensor: shape=(3,), dtype=float32, numpy=array([1., 0., 0.], dtype=float32)>)

One NaN poisons every downstream gradient. The concise-softmax-regression section derives the fix (fuse softmax and log via log-sum-exp) and shows the frameworks already ship it.

The Model

one linear layer, ten logits

Parameters: a 784×10 weight matrix

The Model

Each 28\times28 image flattens to a length-784 vector; with 10 classes the weights are a 784\times10 matrix W plus a length-10 bias b. Initialize W with Gaussian noise, b with zeros:

class SoftmaxRegressionScratch(d2l.Classifier):
    def __init__(self, num_inputs, num_outputs, lr, sigma=0.01):
        super().__init__()
        self.save_hyperparameters()
        self.W = tf.random.normal((num_inputs, num_outputs), 0, sigma)
        self.b = tf.zeros(num_outputs)
        self.W = tf.Variable(self.W)
        self.b = tf.Variable(self.b)

Forward pass: flatten → linear → softmax

The Model

The model is one expression: reshape the batch to rows of 784, apply the affine map \mathbf{X}\mathbf{W}+\mathbf{b}, then softmax into per-class probabilities:

def forward(self, X):
    X = d2l.reshape(X, (-1, self.W.shape[0]))
    return softmax(d2l.matmul(X, self.W) + self.b)

Cross-Entropy Loss

the loss for a predicted distribution

The loss for distributions

Cross-Entropy Loss

For an integer label y, the loss on one example is just the negative log-probability the model assigned to the correct class:

\ell = -\log \hat{y}_{y}.

We pick out \hat{y}_y for every row with fancy indexing, no Python loop. True labels 0 and 2 select the highlighted probabilities:

y_hat = tf.constant([[0.1, 0.3, 0.6], [0.3, 0.2, 0.5]])
y = tf.constant([0, 2])
tf.gather(y_hat, y, batch_dims=1)

<tf.Tensor: shape=(2,), dtype=float32, numpy=array([0.1, 0.5], dtype=float32)>

Minimizing cross-entropy maximizes the log-likelihood of the correct labels. It keeps rewarding higher confidence, nudging 0.51\to0.99 even after the decision is already right.

Average over the batch

Cross-Entropy Loss

Take the negative log of each selected probability, then average. A tiny clip keeps the log finite when a probability underflows to 0:

def cross_entropy(y_hat, y):
    p = tf.gather(y_hat, y, batch_dims=1)
    # Tiny clip to keep log finite when softmax outputs underflow to 0.
    return -tf.reduce_mean(tf.math.log(tf.maximum(p, 1e-12)))

cross_entropy(y_hat, y)

<tf.Tensor: shape=(), dtype=float32, numpy=1.497866153717041>

The clip only masks \log 0; it does not fix the upstream overflow, and it silently kills the gradient on any clamped entry. The cure is the concise-softmax-regression section’s fused loss.

Register it as the loss

Cross-Entropy Loss

Attach cross_entropy as the model’s loss, and every reused training utility now knows how to optimize this classifier:

def loss(self, y_hat, y):
    return cross_entropy(y_hat, y)

Train & Predict

fit on Fashion-MNIST, then inspect mistakes

Train on Fashion-MNIST

Training

Ten epochs of minibatch SGD on Fashion-MNIST. The inherited Classifier runs the validation loop and plots train/validation loss alongside validation accuracy, no extra code:

data = d2l.FashionMNIST(batch_size=256)
model = SoftmaxRegressionScratch(num_inputs=784, num_outputs=10, lr=0.1)
trainer = d2l.Trainer(max_epochs=10)
trainer.fit(model, data)

Predict on a fresh batch

Prediction

Take the argmax of the model’s outputs over a fresh validation batch, one predicted class per image:

X, y = next(iter(data.val_dataloader()))
preds = d2l.argmax(model(X), axis=1)
preds.shape

TensorShape([256])

Look at the mistakes

Prediction

The interesting cases are the errors. Tile the misclassified images, each captioned true / predicted:

wrong = d2l.astype(preds, y.dtype) != y
X, y, preds = X[wrong], y[wrong], preds[wrong]
labels = [a+'\n'+b for a, b in zip(
    data.text_labels(y), data.text_labels(preds))]
data.visualize([X, y], labels=labels)

82%: a linear ceiling

Prediction

Sweep the whole validation set and average the per-example correct flags returned by accuracy(..., averaged=False): the overall test accuracy lands at roughly 82–83%, matching the validation curve above.

That is the ceiling of a linear model on Fashion-MNIST, not a tuning artifact. The next slide shows where the missing 18% lives.

The errors form two blocks, not a blur

Prediction · the confusion matrix

Accumulate a 10\times 10 count matrix over the validation set (the base-classification section’s confusion matrix), normalize each column, and the misses turn out to be anything but uniform:

Upper-body garments (t-shirt, pullover, dress, coat, shirt) trade errors almost exclusively among themselves; the shirt column is the most polluted of all, leaking into t-shirt, pullover, and coat.
Footwear (sandal, sneaker, ankle boot) forms a second, smaller cluster.
Trousers and bags are nearly pure diagonal: silhouette suffices.

Same outline, same mass distribution → indistinguishable to a model that can only weigh pixels linearly.

Why a linear model caps out

The ceiling

A linear classifier draws straight decision boundaries: the softmax-regression section’s picture, now with a price tag. In pixel space shirts and pullovers overlap, and no hyperplane separates them.

The capacity of lines is finite: in the plane a line shatters any 3 points but never the 4-point XOR pattern (the generalization-in- classification section makes this precise). A single hidden layer (the multilayer-perceptrons chapter) bends the boundary and pushes past the ceiling.

Recap

Wrap-up

Softmax = exp, row-sum, divide → a probability distribution over classes.
Cross-entropy = -\log \hat{y}_{\text{true}}, averaged over the batch: the natural classification loss.
Model = flatten → one linear layer (784\times10) → softmax.

Training reuses the regression Trainer; Classifier adds accuracy reporting for free.
82–83% is the linear ceiling on Fashion-MNIST; the confusion matrix shows the errors in two blocks (upper-body garments, footwear), exactly where silhouette fails.
exp(1000) = NaN: the naive softmax is fragile and the clip merely hides it; the concise-softmax-regression section derives the real fix.