Softmax Regression Implementation from Scratch

Dive into Deep Learning · §3.4

Softmax regression from scratch
The whole classifier, opened up: the softmax, the cross-entropy loss, and the training loop, each built by hand.

The same recipe, two new pieces

Motivation

Linear regression mapped inputs to one number. A classifier maps them to a distribution over classes. Two new parts do that:

Softmax turns raw scores (logits) into probabilities.
Cross-entropy is the loss that scores a distribution.

Everything else, the Module / Trainer scaffold, is reused from the regression chapter; Classifier just adds accuracy reporting.

The Softmax

from scores to a probability distribution

First, a reminder: sums along an axis

The Softmax

Softmax normalizes each row, so we need a per-row sum. axis=1 collapses the columns; keepdims holds the shape for broadcasting:

X = d2l.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
d2l.reduce_sum(X, 0, keepdims=True), d2l.reduce_sum(X, 1, keepdims=True)

(tensor([[5., 7., 9.]]),
 tensor([[ 6.],
         [15.]]))

axis=0 sums down columns, axis=1 sums across rows. keepdims=True keeps a length-1 axis so the result still broadcasts against X.

Softmax: exponentiate, sum, divide

The Softmax

\mathrm{softmax}(\mathbf{X})_{ij} = \frac{\exp(\mathbf{X}_{ij})}{\sum_k \exp(\mathbf{X}_{ik})}.

Three steps: exponentiate every score, sum across the class axis, divide each row by its total:

def softmax(X):
    X_exp = d2l.exp(X)
    partition = d2l.reduce_sum(X_exp, 1, keepdims=True)
    return X_exp / partition  # The broadcasting mechanism is applied here

Naive exp overflows for large logits. Fine for teaching; never use it in production. The stable fix arrives in the concise-softmax-regression section.

The output is a real distribution

The Softmax

Feed in any matrix: every entry becomes non-negative and each row sums to 1, exactly what a probability distribution over classes requires:

X = d2l.rand((2, 5))
X_prob = softmax(X)
X_prob, d2l.reduce_sum(X_prob, 1)

(tensor([[0.1487, 0.2466, 0.1683, 0.2644, 0.1720],
         [0.1916, 0.1833, 0.2432, 0.2470, 0.1350]]),
 tensor([1., 1.]))

One logit of 1000, and the answer is NaN

The Softmax · numerical failure

A single logit of 1000 sends \exp to infinity in float32, so the max entry becomes \infty/\infty= NaN (and the rest underflow to 0), poisoning the row. The framework’s softmax shifts by the row maximum first and stays finite on the identical input:

z = torch.tensor([1000., 0., 0.])
naive = torch.exp(z) / torch.exp(z).sum()  # exp(1000) overflows -> nan
stable = torch.softmax(z, dim=0)           # built-in uses the log-sum-exp trick
naive, stable

(tensor([nan, 0., 0.]), tensor([1., 0., 0.]))

One NaN poisons every downstream gradient. The concise-softmax-regression section derives the fix (fuse softmax and log via log-sum-exp) and shows the frameworks already ship it.

The Model

one linear layer, ten logits

Parameters: a 784×10 weight matrix

The Model

Each 28\times28 image flattens to a length-784 vector; with 10 classes the weights are a 784\times10 matrix W plus a length-10 bias b. Initialize W with Gaussian noise, b with zeros:

class SoftmaxRegressionScratch(d2l.Classifier):
    def __init__(self, num_inputs, num_outputs, lr, sigma=0.01):
        super().__init__()
        self.save_hyperparameters()
        self.W = torch.normal(0, sigma, size=(num_inputs, num_outputs),
                              requires_grad=True)
        self.b = torch.zeros(num_outputs, requires_grad=True)

    def parameters(self):
        return [self.W, self.b]

Forward pass: flatten → linear → softmax

The Model

The model is one expression: reshape the batch to rows of 784, apply the affine map \mathbf{X}\mathbf{W}+\mathbf{b}, then softmax into per-class probabilities:

def forward(self, X):
    X = d2l.reshape(X, (-1, self.W.shape[0]))
    return softmax(d2l.matmul(X, self.W) + self.b)

Cross-Entropy Loss

the loss for a predicted distribution

The loss for distributions

Cross-Entropy Loss

For an integer label y, the loss on one example is just the negative log-probability the model assigned to the correct class:

\ell = -\log \hat{y}_{y}.

We pick out \hat{y}_y for every row with fancy indexing, no Python loop. True labels 0 and 2 select the highlighted probabilities:

y = d2l.tensor([0, 2])
y_hat = d2l.tensor([[0.1, 0.3, 0.6], [0.3, 0.2, 0.5]])
y_hat[[0, 1], y]

tensor([0.1000, 0.5000])

Minimizing cross-entropy maximizes the log-likelihood of the correct labels. It keeps rewarding higher confidence, nudging 0.51\to0.99 even after the decision is already right.

Average over the batch

Cross-Entropy Loss

Take the negative log of each selected probability, then average. A tiny clip keeps the log finite when a probability underflows to 0:

def cross_entropy(y_hat, y):
    # Tiny clip to keep log finite when softmax outputs underflow to 0.
    p = y_hat[list(range(len(y_hat))), y].clamp(min=1e-12)
    return -d2l.reduce_mean(d2l.log(p))

cross_entropy(y_hat, y)

tensor(1.4979)

The clip only masks \log 0; it does not fix the upstream overflow, and it silently kills the gradient on any clamped entry. The cure is the concise-softmax-regression section’s fused loss.

Register it as the loss

Cross-Entropy Loss

Attach cross_entropy as the model’s loss, and every reused training utility now knows how to optimize this classifier:

def loss(self, y_hat, y):
    return cross_entropy(y_hat, y)

Train & Predict

fit on Fashion-MNIST, then inspect mistakes

Train on Fashion-MNIST

Training

Ten epochs of minibatch SGD on Fashion-MNIST. The inherited Classifier runs the validation loop and plots train/validation loss alongside validation accuracy, no extra code:

data = d2l.FashionMNIST(batch_size=256)
model = SoftmaxRegressionScratch(num_inputs=784, num_outputs=10, lr=0.1)
trainer = d2l.Trainer(max_epochs=10)
trainer.fit(model, data)

Predict on a fresh batch

Prediction

Take the argmax of the model’s outputs over a fresh validation batch, one predicted class per image:

X, y = next(iter(data.val_dataloader()))
with torch.no_grad():
    preds = d2l.argmax(model(X), axis=1)
preds.shape

torch.Size([256])

Look at the mistakes

Prediction

The interesting cases are the errors. Tile the misclassified images, each captioned true / predicted:

wrong = d2l.astype(preds, y.dtype) != y
X, y, preds = X[wrong], y[wrong], preds[wrong]
labels = [a+'\n'+b for a, b in zip(
    data.text_labels(y), data.text_labels(preds))]
data.visualize([X, y], labels=labels)

82%: a linear ceiling

Prediction

Sweep the whole validation set and average the per-example correct flags:

correct = []
for X_i, y_i in data.val_dataloader():
    with torch.no_grad():
        correct.append(model.accuracy(model(X_i), y_i, averaged=False))
print(f'Test accuracy: {torch.cat(correct).mean():.3f}')

Test accuracy: 0.823

Roughly 82–83% run to run: the ceiling of a linear model on Fashion-MNIST, not a tuning artifact. The next slide shows where the missing 18% lives.

The errors form two blocks, not a blur

Prediction · the confusion matrix

Accumulate a 10\times 10 count matrix over the validation set (the base-classification section’s confusion matrix) and normalize each column:

Upper-body garments (t-shirt, pullover, dress, coat, shirt) trade errors almost exclusively among themselves; the shirt column is the most polluted of all.
Footwear (sandal, sneaker, ankle boot) forms a second cluster.
Trousers and bags are nearly pure diagonal: silhouette suffices.

Same outline, same mass distribution → indistinguishable to a model that can only weigh pixels linearly.

Why a linear model caps out

The ceiling

A linear classifier draws straight decision boundaries: the softmax-regression section’s picture, now with a price tag. In pixel space shirts and pullovers overlap, and no hyperplane separates them.

The capacity of lines is finite: in the plane a line shatters any 3 points but never the 4-point XOR pattern (the generalization-in- classification section makes this precise). A single hidden layer (the multilayer-perceptrons chapter) bends the boundary and pushes past the ceiling.

Recap

Wrap-up

Softmax = exp, row-sum, divide → a probability distribution over classes.
Cross-entropy = -\log \hat{y}_{\text{true}}, averaged over the batch: the natural classification loss.
Model = flatten → one linear layer (784\times10) → softmax.

Training reuses the regression Trainer; Classifier adds accuracy reporting for free.
82–83% is the linear ceiling on Fashion-MNIST; the confusion matrix shows the errors in two blocks (upper-body garments, footwear), exactly where silhouette fails.
exp(1000) = NaN: the naive softmax is fragile and the clip merely hides it; the concise-softmax-regression section derives the real fix.