Linear Algebra

Linear Algebra Toolkit

The minimum linear-algebra vocabulary every chapter that follows assumes:

  • Scalars / vectors / matrices / tensors — the four ranks.
  • Arithmetic — element-wise, with broadcasting.
  • Reductions — sum, mean, along chosen axes.
  • Products — dot, matrix-vector, matrix-matrix.
  • Norms\ell_1, \ell_2, Frobenius.

Each piece comes with a one-liner of code so you can see the API.

Scalars

Scalars are rank-0 tensors — a single number with all the usual arithmetic operators:

x = torch.tensor(3.0)
y = torch.tensor(2.0)

x + y, x * y, x / y, x**y
(tensor(5.), tensor(6.), tensor(1.5000), tensor(9.))

Vectors

A vector is a 1-D array of scalars:

x = torch.arange(3)
x
tensor([0, 1, 2])

Element access uses standard indexing:

x[2]
tensor(2)

Length and shape

The length of a vector is its number of elements:

len(x)
3

For higher-rank tensors len() is just shape[0]. Use .shape when you need every axis:

x.shape
torch.Size([3])

Matrices

A matrix is a rank-2 tensor — m rows × n columns:

A = torch.arange(6).reshape(3, 2)
A
tensor([[0, 1],
        [2, 3],
        [4, 5]])

The transpose flips rows and columns; the same data, axes swapped:

A.T
tensor([[0, 2, 4],
        [1, 3, 5]])

Symmetric matrices

A matrix is symmetric when it equals its own transpose:

\mathbf{A} = \mathbf{A}^\top.

A = torch.tensor([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
A == A.T
tensor([[True, True, True],
        [True, True, True],
        [True, True, True]])

Useful: the input to many losses (covariance, Gram matrix) is symmetric.

Higher-rank tensors

The naming generalizes — a rank-n tensor has n axes. A 3-D tensor is the shape of a stack of matrices (think batched RGB images: batch × height × width × channels in TF, batch × channels × height × width in PyTorch):

torch.arange(24).reshape(2, 3, 4)
tensor([[[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11]],

        [[12, 13, 14, 15],
         [16, 17, 18, 19],
         [20, 21, 22, 23]]])

Element-wise arithmetic

Two tensors of the same shape combine element-wise:

A = torch.arange(6, dtype=torch.float32).reshape(2, 3)
B = A.clone()  # Assign a copy of A to B by allocating new memory
A, A + B
(tensor([[0., 1., 2.],
         [3., 4., 5.]]),
 tensor([[ 0.,  2.,  4.],
         [ 6.,  8., 10.]]))

The element-wise product of matrices is the Hadamard product \mathbf{A} \odot \mathbf{B}:

A * B
tensor([[ 0.,  1.,  4.],
        [ 9., 16., 25.]])

Scalar–tensor arithmetic

A scalar broadcasts to every element of a tensor:

a = 2
X = torch.arange(24).reshape(2, 3, 4)
a + X, (a * X).shape
(tensor([[[ 2,  3,  4,  5],
          [ 6,  7,  8,  9],
          [10, 11, 12, 13]],
 
         [[14, 15, 16, 17],
          [18, 19, 20, 21],
          [22, 23, 24, 25]]]),
 torch.Size([2, 3, 4]))

Reductions: sum

The sum \sum_i x_i collapses every element into one scalar:

x = torch.arange(3, dtype=torch.float32)
x, x.sum()
(tensor([0., 1., 2.]), tensor(3.))

Same call works for any rank — it folds across all axes by default:

A.shape, A.sum()
(torch.Size([2, 3]), tensor(15.))

Reducing along an axis

To collapse only one or some axes, pass axis=:

A.shape, A.sum(axis=0).shape
(torch.Size([2, 3]), torch.Size([3]))
A.shape, A.sum(axis=1).shape
(torch.Size([2, 3]), torch.Size([2]))

axis=0 collapses rows (output rank drops by one along that axis), axis=1 collapses columns.

Reducing all axes

A list of axes reduces over each:

A.sum(axis=[0, 1]) == A.sum()  # Same as A.sum()
tensor(True)

axis=[0,1] is identical to the default sum() for a rank-2 tensor.

Mean

\bar x = \frac{1}{n} \sum_i x_i. Either built-in mean() or sum() / numel():

A.mean(), A.sum() / A.numel()
(tensor(2.5000), tensor(2.5000))

And along a single axis:

A.mean(axis=0), A.sum(axis=0) / A.shape[0]
(tensor([1.5000, 2.5000, 3.5000]), tensor([1.5000, 2.5000, 3.5000]))

Non-reducing sum (keepdims)

Set keepdims=True to preserve the reduced axis (size 1) so broadcasting still works:

sum_A = A.sum(axis=1, keepdims=True)
sum_A, sum_A.shape
(tensor([[ 3.],
         [12.]]),
 torch.Size([2, 1]))

Now A / sum_A divides every row by its sum — common normalization:

A / sum_A
tensor([[0.0000, 0.3333, 0.6667],
        [0.2500, 0.3333, 0.4167]])

Cumulative sum

cumsum(axis=k) keeps the axis but reports a running total — useful for time-series and prefix sums:

A.cumsum(axis=0)
tensor([[0., 1., 2.],
        [3., 5., 7.]])

Dot product

\mathbf{x}^\top \mathbf{y} = \sum_i x_i y_i — element-wise multiply, then sum:

y = torch.ones(3, dtype = torch.float32)
x, y, torch.dot(x, y)
(tensor([0., 1., 2.]), tensor([1., 1., 1.]), tensor(3.))

Two equivalent ways to compute it:

torch.sum(x * y)
tensor(3.)

Matrix products

\mathbf{A}\mathbf{x} is a length-m vector — one dot product per row of A. The most ubiquitous operation in deep learning: a fully-connected layer’s forward pass.

A.shape, x.shape, torch.mv(A, x), A@x
(torch.Size([2, 3]), torch.Size([3]), tensor([ 5., 14.]), tensor([ 5., 14.]))

\mathbf{AB} is m matrix-vector products stitched into a matrix (equivalently, m \cdot n row-by-column dot products):

B = torch.ones(3, 4)
torch.mm(A, B), A@B
(tensor([[ 3.,  3.,  3.,  3.],
         [12., 12., 12., 12.]]),
 tensor([[ 3.,  3.,  3.,  3.],
         [12., 12., 12., 12.]]))

Norms

The \ell_2 norm — Euclidean length, the workhorse of optimization:

\|\mathbf{x}\|_2 = \sqrt{\sum_i x_i^2}.\qquad \|\mathbf{x}\|_1 = \sum_i |x_i|.\qquad \|\mathbf{X}\|_\text{F} = \sqrt{\sum_{i,j} x_{ij}^2}.

u = torch.tensor([3.0, -4.0])
torch.norm(u)
tensor(5.)

\ell_1 is less sensitive to outliers and promotes sparsity:

torch.abs(u).sum()
tensor(7.)

For matrices, Frobenius is the \ell_2 of the flattened matrix:

torch.norm(torch.ones((4, 9)))
tensor(6.)

Recap

  • Scalars / vectors / matrices / tensors are ranks 0 / 1 / 2 / n.
  • Element-wise ops, scalar broadcasting, Hadamard product (*).
  • Reductions: sum, mean, with axis= and keepdims=.
  • Products: dot, mv, mm / @.
  • Norms: \ell_1, \ell_2, Frobenius.

Most deep-learning math compiles down to this short list.