Softmax Regression Implementation from Scratch

Dive into Deep Learning · §3.4

Softmax regression from scratch
The whole classifier, opened up: the softmax, the cross-entropy loss, and the training loop, each built by hand.

The same recipe, two new pieces

Motivation

Linear regression mapped inputs to one number. A classifier maps them to a distribution over classes. Two new parts do that:

Softmax turns raw scores (logits) into probabilities.
Cross-entropy is the loss that scores a distribution.

Everything else, the Module / Trainer scaffold, is reused from the regression chapter; Classifier just adds accuracy reporting.

The Softmax

from scores to a probability distribution

First, a reminder: sums along an axis

The Softmax

Softmax normalizes each row, so we need a per-row sum. axis=1 collapses the columns; keepdims holds the shape for broadcasting:

X = d2l.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
d2l.reduce_sum(X, 0, keepdims=True), d2l.reduce_sum(X, 1, keepdims=True)

(Array([[5., 7., 9.]], dtype=float32),
 Array([[ 6.],
        [15.]], dtype=float32))

axis=0 sums down columns, axis=1 sums across rows. keepdims=True keeps a length-1 axis so the result still broadcasts against X.

Softmax: exponentiate, sum, divide

The Softmax

\mathrm{softmax}(\mathbf{X})_{ij} = \frac{\exp(\mathbf{X}_{ij})}{\sum_k \exp(\mathbf{X}_{ik})}.

Three steps: exponentiate every score, sum across the class axis, divide each row by its total:

def softmax(X):
    X_exp = d2l.exp(X)
    partition = d2l.reduce_sum(X_exp, 1, keepdims=True)
    return X_exp / partition  # The broadcasting mechanism is applied here

Naive exp overflows for large logits. Fine for teaching; never use it in production. The stable fix arrives in the concise-softmax-regression section.

The output is a real distribution

The Softmax

Feed in any matrix: every entry becomes non-negative and each row sums to 1, exactly what a probability distribution over classes requires:

X = jax.random.uniform(d2l.get_key(), (2, 5))
X_prob = softmax(X)
X_prob, d2l.reduce_sum(X_prob, 1)

(Array([[0.15490438, 0.15702507, 0.27504665, 0.22082093, 0.19220296],
        [0.14286114, 0.14374252, 0.23495987, 0.13943408, 0.33900243]],      dtype=float32),
 Array([1., 1.], dtype=float32))

One logit of 1000, and the answer is NaN

The Softmax · numerical failure

A single logit of 1000 sends \exp to infinity in float32, so the max entry becomes \infty/\infty= NaN (and the rest underflow to 0), poisoning the row. The framework’s softmax shifts by the row maximum first and stays finite on the identical input:

z = jnp.array([1000., 0., 0.])
naive = jnp.exp(z) / jnp.exp(z).sum()      # exp(1000) overflows -> nan
stable = jax.nn.softmax(z, axis=0)         # built-in uses the log-sum-exp trick
naive, stable

(Array([nan,  0.,  0.], dtype=float32), Array([1., 0., 0.], dtype=float32))

One NaN poisons every downstream gradient. The concise-softmax-regression section derives the fix (fuse softmax and log via log-sum-exp) and shows the frameworks already ship it.

The Model

one linear layer, ten logits

Parameters: a 784×10 weight matrix

The Model

Each 28\times28 image flattens to a length-784 vector; with 10 classes the weights are a 784\times10 matrix W plus a length-10 bias b. Initialize W with Gaussian noise, b with zeros:

class SoftmaxRegressionScratch(d2l.Classifier):
    def __init__(self, num_inputs, num_outputs, lr, sigma=0.01, rngs=None):
        super().__init__()
        self.save_hyperparameters(ignore=['rngs'])
        rngs = nnx.Rngs(d2l.get_key()) if rngs is None else rngs
        self.W = nnx.Param(
            rngs.params.normal((num_inputs, num_outputs)) * sigma)
        self.b = nnx.Param(jnp.zeros(num_outputs))

Forward pass: flatten → linear → softmax

The Model

The model is one expression: reshape the batch to rows of 784, apply the affine map \mathbf{X}\mathbf{W}+\mathbf{b}, then softmax into per-class probabilities:

def forward(self, X):
    X = d2l.reshape(X, (-1, self.W.shape[0]))
    return softmax(d2l.matmul(X, self.W) + self.b)

Cross-Entropy Loss

the loss for a predicted distribution

The loss for distributions

Cross-Entropy Loss

For an integer label y, the loss on one example is just the negative log-probability the model assigned to the correct class:

\ell = -\log \hat{y}_{y}.

We pick out \hat{y}_y for every row with fancy indexing, no Python loop. True labels 0 and 2 select the highlighted probabilities:

y = d2l.tensor([0, 2])
y_hat = d2l.tensor([[0.1, 0.3, 0.6], [0.3, 0.2, 0.5]])
y_hat[[0, 1], y]

Array([0.1, 0.5], dtype=float32)

Minimizing cross-entropy maximizes the log-likelihood of the correct labels. It keeps rewarding higher confidence, nudging 0.51\to0.99 even after the decision is already right.

Average over the batch

Cross-Entropy Loss

Take the negative log of each selected probability, then average. A tiny clip keeps the log finite when a probability underflows to 0:

def cross_entropy(y_hat, y):
    # Tiny clip to keep log finite when softmax outputs underflow to 0.
    p = jnp.clip(jnp.take_along_axis(y_hat, jnp.expand_dims(y, -1),
                                     axis=1).squeeze(-1), min=1e-12)
    return -d2l.reduce_mean(d2l.log(p))

cross_entropy(y_hat, y)

Array(1.4978662, dtype=float32)

The clip only masks \log 0; it does not fix the upstream overflow, and it silently kills the gradient on any clamped entry. The cure is the concise-softmax-regression section’s fused loss.

Register it as the loss

Cross-Entropy Loss

Attach cross_entropy as the model’s loss, and every reused training utility now knows how to optimize this classifier:

def loss(self, y_hat, y):
    return cross_entropy(y_hat, y)

Train & Predict

fit on Fashion-MNIST, then inspect mistakes

Train on Fashion-MNIST

Training

Ten epochs of minibatch SGD on Fashion-MNIST. The inherited Classifier runs the validation loop and plots train/validation loss alongside validation accuracy, no extra code:

data = d2l.FashionMNIST(batch_size=256)
model = SoftmaxRegressionScratch(num_inputs=784, num_outputs=10, lr=0.1)
trainer = d2l.Trainer(max_epochs=10)
trainer.fit(model, data)

Predict on a fresh batch

Prediction

Take the argmax of the model’s outputs over a fresh validation batch, one predicted class per image:

X, y = next(iter(data.val_dataloader()))
preds = d2l.argmax(model(X), axis=1)
preds.shape

(256,)

Look at the mistakes

Prediction

The interesting cases are the errors. Tile the misclassified images, each captioned true / predicted:

wrong = d2l.astype(preds, y.dtype) != y
X, y, preds = X[wrong], y[wrong], preds[wrong]
labels = [a+'\n'+b for a, b in zip(
    data.text_labels(y), data.text_labels(preds))]
data.visualize([X, y], labels=labels)

82%: a linear ceiling

Prediction

Sweep the whole validation set and average the per-example correct flags returned by accuracy(..., averaged=False): the overall test accuracy lands at roughly 82–83%, matching the validation curve above.

That is the ceiling of a linear model on Fashion-MNIST, not a tuning artifact. The next slide shows where the missing 18% lives.

The errors form two blocks, not a blur

Prediction · the confusion matrix

Accumulate a 10\times 10 count matrix over the validation set (the base-classification section’s confusion matrix), normalize each column, and the misses turn out to be anything but uniform:

Upper-body garments (t-shirt, pullover, dress, coat, shirt) trade errors almost exclusively among themselves; the shirt column is the most polluted of all, leaking into t-shirt, pullover, and coat.
Footwear (sandal, sneaker, ankle boot) forms a second, smaller cluster.
Trousers and bags are nearly pure diagonal: silhouette suffices.

Same outline, same mass distribution → indistinguishable to a model that can only weigh pixels linearly.

Why a linear model caps out

The ceiling

A linear classifier draws straight decision boundaries: the softmax-regression section’s picture, now with a price tag. In pixel space shirts and pullovers overlap, and no hyperplane separates them.

The capacity of lines is finite: in the plane a line shatters any 3 points but never the 4-point XOR pattern (the generalization-in- classification section makes this precise). A single hidden layer (the multilayer-perceptrons chapter) bends the boundary and pushes past the ceiling.

Recap

Wrap-up

Softmax = exp, row-sum, divide → a probability distribution over classes.
Cross-entropy = -\log \hat{y}_{\text{true}}, averaged over the batch: the natural classification loss.
Model = flatten → one linear layer (784\times10) → softmax.

Training reuses the regression Trainer; Classifier adds accuracy reporting for free.
82–83% is the linear ceiling on Fashion-MNIST; the confusion matrix shows the errors in two blocks (upper-body garments, footwear), exactly where silhouette fails.
exp(1000) = NaN: the naive softmax is fragile and the clip merely hides it; the concise-softmax-regression section derives the real fix.