Softmax Regression Implementation from Scratch

Softmax regression from scratch

The same recipe as linear regression, with two new pieces:

  1. Softmax turns logits into a probability distribution.
  2. Cross-entropy is the loss for distributions.

Wired into the same Module / Trainer scaffold from the regression chapter — Classifier adds accuracy reporting and we inherit the rest.

Sums along an axis

Quick reminder before defining softmax — sum along chosen axes:

from d2l import tensorflow as d2l
import tensorflow as tf
X = d2l.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
d2l.reduce_sum(X, 0, keepdims=True), d2l.reduce_sum(X, 1, keepdims=True)
(<tf.Tensor: shape=(1, 3), dtype=float32, numpy=array([[5., 7., 9.]], dtype=float32)>,
 <tf.Tensor: shape=(2, 1), dtype=float32, numpy=
 array([[ 6.],
        [15.]], dtype=float32)>)

Softmax

\mathrm{softmax}(\mathbf{X})_{ij} = \frac{\exp(\mathbf{X}_{ij})}{\sum_k \exp(\mathbf{X}_{ik})}.

Three steps: exponentiate, sum across the class axis, divide.

def softmax(X):
    X_exp = d2l.exp(X)
    partition = d2l.reduce_sum(X_exp, 1, keepdims=True)
    return X_exp / partition  # The broadcasting mechanism is applied here

Result: every row is non-negative and sums to 1 — a valid probability distribution over classes:

X = d2l.rand((2, 5))
X_prob = softmax(X)
X_prob, d2l.reduce_sum(X_prob, 1)
(<tf.Tensor: shape=(2, 5), dtype=float32, numpy=
 array([[0.18055958, 0.17420632, 0.24367459, 0.2057537 , 0.19580577],
        [0.16664788, 0.12496491, 0.14281595, 0.24873221, 0.31683904]],
       dtype=float32)>,
 <tf.Tensor: shape=(2,), dtype=float32, numpy=array([1., 1.], dtype=float32)>)

The model

Flatten each 32×32 image into a 1024-vector, hit one linear layer that outputs 10 logits — one per class:

class SoftmaxRegressionScratch(d2l.Classifier):
    def __init__(self, num_inputs, num_outputs, lr, sigma=0.01):
        super().__init__()
        self.save_hyperparameters()
        self.W = tf.random.normal((num_inputs, num_outputs), 0, sigma)
        self.b = tf.zeros(num_outputs)
        self.W = tf.Variable(self.W)
        self.b = tf.Variable(self.b)

The forward pass = flatten → linear → softmax:

def forward(self, X):
    X = d2l.reshape(X, (-1, self.W.shape[0]))
    return softmax(d2l.matmul(X, self.W) + self.b)

Cross-entropy loss

For label y (an integer class), the loss on one example is just

\ell = -\log \hat{y}_{y}

— the negative log of the predicted probability of the correct class. Here are two examples with 3 classes:

y_hat = tf.constant([[0.1, 0.3, 0.6], [0.3, 0.2, 0.5]])
y = tf.constant([0, 2])
tf.boolean_mask(y_hat, tf.one_hot(y, depth=y_hat.shape[-1]))
<tf.Tensor: shape=(2,), dtype=float32, numpy=array([0.1, 0.5], dtype=float32)>

Implementing it

One line — fancy indexing pulls out y_hat[i, y[i]] for each example, then negative log:

def cross_entropy(y_hat, y):
    p = tf.boolean_mask(y_hat, tf.one_hot(y, depth=y_hat.shape[-1]))
    # Tiny clip to keep log finite when softmax outputs underflow to 0.
    return -tf.reduce_mean(tf.math.log(tf.maximum(p, 1e-12)))

cross_entropy(y_hat, y)
<tf.Tensor: shape=(), dtype=float32, numpy=1.497866153717041>
def loss(self, y_hat, y):
    return cross_entropy(y_hat, y)

Train

10 epochs on Fashion-MNIST. The base Classifier already handles the validation loop and accuracy reporting:

data = d2l.FashionMNIST(batch_size=256)
model = SoftmaxRegressionScratch(num_inputs=784, num_outputs=10, lr=0.1)
trainer = d2l.Trainer(max_epochs=10)
trainer.fit(model, data)

Predicting

Pull a fresh validation batch and look at predicted vs. true classes:

X, y = next(iter(data.val_dataloader()))
preds = d2l.argmax(model(X), axis=1)
preds.shape
TensorShape([256])

Tile the misclassified images, captioned with predicted / true:

wrong = d2l.astype(preds, y.dtype) != y
X, y, preds = X[wrong], y[wrong], preds[wrong]
labels = [a+'\n'+b for a, b in zip(
    data.text_labels(y), data.text_labels(preds))]
data.visualize([X, y], labels=labels)

Linear models cap out around ~83% on Fashion-MNIST — easy classes right, ambiguous shirt-vs-pullover wrong.

Recap

  • Softmax = exp + row-sum normalization → probabilities.
  • Cross-entropy = -\log p_\text{correct}, the standard classification loss.
  • A 10-output linear layer + softmax + CE loss is the baseline classifier — anything fancier (MLPs, CNNs) just replaces the forward pass.