Linear Algebra

Dive into Deep Learning · §1.3

Every model in this book compiles down to a short list of operations
vectors · matrices · products · norms · eigenvalues.

Five ideas carry every later chapter

Motivation

Objects: scalars, vectors, matrices, tensors (ranks 0, 1, 2, n).
Arithmetic: element-wise, plus scalar broadcasting.
Reductions: sum and mean, along chosen axes.
Products: dot, matrix–vector, matrix–matrix.
Norms & eigenvalues: how big, and which directions survive.

Rank = number of axes; shape = size per axis.

The objects

scalars, vectors, matrices, tensors (and one flip)

A vector is numbers with a geometry

The objects

A scalar is a rank-0 tensor: one number. Stack n of them and you get a vector: a data record and an arrow in \mathbb{R}^n, with a length and a direction:

x = torch.arange(3)
x

tensor([0, 1, 2])

Both readings matter: a row of a dataset is a vector, and so is the direction a training step moves the weights.

`.shape` answers the first question about any tensor

The objects

len counts a vector’s elements:

len(x)

.shape works at every rank, one size per axis:

x.shape

torch.Size([3])

The transpose flips a matrix across its diagonal

The objects

A matrix is a rank-2 tensor, m rows \times n columns:

A = torch.arange(6).reshape(3, 2)
A

tensor([[0, 1],
        [2, 3],
        [4, 5]])

\mathbf{A}^\top swaps the roles of rows and columns:

A.T

tensor([[0, 2, 4],
        [1, 3, 5]])

Symmetric matrices are their own transpose

The objects

\mathbf{A} = \mathbf{A}^\top: the flip changes nothing, and code can check it in one line:

A = torch.tensor([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
(A == A.T).all()

tensor(True)

Covariance and Gram matrices are symmetric, a structure that many methods (and this deck’s finale) exploit.

Rank n is just n axes

The objects

Stack matrices and the naming continues. A batch of images is rank-4 (N×C×H×W in PyTorch, N×H×W×C in TF):

torch.arange(24).reshape(2, 3, 4)

Arithmetic & reduction

element-wise ops · sums that drop or keep axes

Same shapes combine element-wise

Arithmetic

Two tensors of one shape combine entry by entry:

A = torch.arange(6, dtype=torch.float32).reshape(2, 3)
B = A.clone()  # Assign a copy of A to B by allocating new memory
A, A + B

(tensor([[0., 1., 2.],
         [3., 4., 5.]]),
 tensor([[ 0.,  2.,  4.],
         [ 6.,  8., 10.]]))

The element-wise product is the Hadamard product \mathbf{A} \odot \mathbf{B}:

A * B

tensor([[ 0.,  1.,  4.],
        [ 9., 16., 25.]])

A scalar broadcasts to every element

Arithmetic

Adding or multiplying by a scalar touches each entry and leaves the shape alone:

a = 2
X = torch.arange(24).reshape(2, 3, 4)
a + X, (a * X).shape

(tensor([[[ 2,  3,  4,  5],
          [ 6,  7,  8,  9],
          [10, 11, 12, 13]],
 
         [[14, 15, 16, 17],
          [18, 19, 20, 21],
          [22, 23, 24, 25]]]),
 torch.Size([2, 3, 4]))

A reduction folds many numbers into one

Reduction

sum() with no arguments collapses everything to a scalar:

x = torch.arange(3, dtype=torch.float32)
x, x.sum()

(tensor([0., 1., 2.]), tensor(3.))

mean() is the same fold, divided by the count:

A.mean(), A.sum() / A.numel()

(tensor(2.5000), tensor(2.5000))

`axis=` chooses which dimension disappears

Reduction

The output loses exactly the axis you name:

A.shape, A.sum(axis=0).shape

(torch.Size([2, 3]), torch.Size([3]))

A.shape, A.sum(axis=1).shape

(torch.Size([2, 3]), torch.Size([2]))

axis=0 sums down the columns; axis=1 sums across the rows.

`keepdims`: reduce, but stay broadcastable

Reduction

keepdims=True keeps the folded axis at size 1:

sum_A = A.sum(axis=1, keepdims=True)
sum_A, sum_A.shape

(tensor([[ 3.],
         [12.]]),
 torch.Size([2, 1]))

…so one broadcast division normalizes every row to sum to 1:

A / sum_A

tensor([[0.0000, 0.3333, 0.6667],
        [0.2500, 0.3333, 0.4167]])

Products

one idea at three sizes: dot · matrix–vector · matrix–matrix

The dot product multiplies, then sums

Products

\mathbf{x}^\top\mathbf{y} = \sum_i x_i y_i: multiply matching entries, add them up:

y = torch.ones(3, dtype = torch.float32)
x, y, torch.dot(x, y)

(tensor([0., 1., 2.]), tensor([1., 1., 1.]), tensor(3.))

With nonnegative weights summing to 1, the dot product is a weighted average.

Normalized, the dot product is cos θ

Products · geometry

\cos\theta = \frac{\mathbf{x}^\top\mathbf{y}}{\|\mathbf{x}\|\,\|\mathbf{y}\|}.

+1 aligned · 0 perpendicular · -1 opposed: the dot product is deep learning’s favorite similarity measure.

Cauchy–Schwarz keeps the ratio in [−1, 1]

Products · geometry

Why can \cos\theta never escape [-1, 1]? That is the Cauchy–Schwarz inequality |\mathbf{x}^\top\mathbf{y}| \le \|\mathbf{x}\|\,\|\mathbf{y}\| (proved in the geometry-and-linear-algebraic-operations section). One random pair checks both facts at once:

u, v = torch.randn(8), torch.randn(8)
cos_theta = torch.dot(u, v) / (torch.norm(u) * torch.norm(v))
torch.acos(cos_theta), torch.abs(torch.dot(u, v)) <= torch.norm(u) * torch.norm(v)

(tensor(1.5027), tensor(True))

An angle in [0, \pi], and the inequality holds on every draw.

Ax takes one dot product per row

Products

(\mathbf{A}\mathbf{x})_i = \mathbf{a}^\top_i \mathbf{x}, so a 2\times3 matrix maps a length-3 vector to a length-2 vector:

A.shape, x.shape, torch.mv(A, x), A@x

(torch.Size([2, 3]), torch.Size([3]), tensor([ 5., 14.]), tensor([ 5., 14.]))

Every fully-connected layer computes exactly this (plus a nonlinearity); much more on that later.

Matrices move vectors: here, a 90° turn

Products

Multiplication by \mathbf{A} \in \mathbb{R}^{m\times n} is a linear map \mathbb{R}^n \to \mathbb{R}^m. The rotation matrix \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix} turns the plane by \theta; at \theta = 90° it sends \mathbf{e}_1 \mapsto \mathbf{e}_2 and \mathbf{e}_2 \mapsto -\mathbf{e}_1:

R = torch.tensor([[0.0, -1.0], [1.0, 0.0]])  # Rotation by 90 degrees
R @ torch.tensor([1.0, 0.0]), R @ torch.tensor([0.0, 1.0])

(tensor([0., 1.]), tensor([-1.,  0.]))

AB stitches m×n dot products into a matrix

Products

Entry c_{ij} is row i of \mathbf{A} dotted with column j of \mathbf{B}:

B = torch.ones(3, 4)
torch.mm(A, B), A@B

(tensor([[ 3.,  3.,  3.,  3.],
         [12., 12., 12., 12.]]),
 tensor([[ 3.,  3.,  3.,  3.],
         [12., 12., 12., 12.]]))

Norms & eigenvalues

how long is a vector, and which directions a matrix keeps

Two rulers: ℓ₂ walks straight, ℓ₁ walks the grid

Norms

For \mathbf{u} = [3, -4] the Euclidean ruler reads 5; the taxicab ruler reads 7:

u = torch.tensor([3.0, -4.0])
torch.norm(u)

tensor(5.)

torch.abs(u).sum()

tensor(7.)

\|\mathbf{x}\|_2 = \sqrt{\textstyle\sum_i x_i^2}, \qquad \|\mathbf{x}\|_1 = \textstyle\sum_i |x_i|.

The three norm axioms are checkable facts

Norms

Homogeneity \|\alpha\mathbf{x}\| = |\alpha|\,\|\mathbf{x}\| and the triangle inequality \|\mathbf{x}+\mathbf{y}\| \le \|\mathbf{x}\|+\|\mathbf{y}\|, holding on random vectors:

u, v, alpha = torch.randn(6), torch.randn(6), -2.5
print(torch.norm(alpha * u), abs(alpha) * torch.norm(u))
print(torch.norm(u + v) <= torch.norm(u) + torch.norm(v))

tensor(2.5362) tensor(2.5362)
tensor(True)

For \ell_2, the triangle inequality is Cauchy–Schwarz in disguise: expand \|\mathbf{u}+\mathbf{v}\|^2 and bound the cross term (the geometry-and-linear-algebraic-operations section).

Frobenius measures a matrix as one long vector

Norms

\|\mathbf{X}\|_\textrm{F} = \sqrt{\sum_{i,j} x_{ij}^2} is the \ell_2 norm of the flattened matrix. For the all-ones 4\times9: \sqrt{36} = 6:

torch.norm(torch.ones((4, 9)))

tensor(6.)

The spectral norm (how much \mathbf{X} can stretch a vector) needs the singular value decomposition; it arrives in the SVD-and-low-rank- approximation section.

Eigenvectors: the directions a matrix does not turn

Eigenvalues

\mathbf{A}\mathbf{v} = \lambda\mathbf{v}.

Along an eigenvector, the matrix acts like a scalar: stretch (|\lambda|>1), shrink (|\lambda|<1), flip (\lambda<0), but never turn.

The symmetric matrix from the transpose slide has three real eigenvalues:

S = torch.tensor([[1.0, 2, 3], [2, 0, 4], [3, 4, 5]])
torch.linalg.eigvalsh(S)  # Eigenvalues of a symmetric matrix are real

tensor([-2.2628, -0.5985,  8.8612])

Keep 8.8612 in mind.

Ten multiplications of a random vector find 8.8612

Eigenvalues · payoff

Multiply a random vector by \mathbf{S} ten times and measure how much the norm grows per step:

v = torch.randn(3)
for _ in range(10):
    prev, v = v, S @ v
torch.norm(v) / torch.norm(prev)

tensor(8.8612)

The growth factor converges to \max_i |\lambda_i| = 8.8612: whatever vector you start from, the largest eigenvalue soon dominates.

Deep networks multiply by dozens of matrices in a row. Whether signals and gradients explode or vanish is this experiment at scale; the analysis returns in the numerical-stability section.

Recap

Wrap-up

Objects: ranks 0–n; the transpose; symmetry.
Element-wise ops + scalar broadcasting; Hadamard \odot.
Reductions: sum/mean with axis=; keepdims= stays broadcastable.
Dot product \sum_i x_i y_i; normalized it is \cos\theta (Cauchy–Schwarz).

\mathbf{A}\mathbf{x}: one dot product per row (the layer primitive); \mathbf{A}\mathbf{B}: m \times n of them.
Norms: \ell_2 = 5 and \ell_1 = 7 for [3,-4]; Frobenius for matrices; three axioms, all checkable.
Eigenvectors are scaled, never turned, and \max|\lambda| = 8.8612 dominated repeated multiplication.

Next, calculus (the calculus section): every gradient there is built from these products. The full linear-algebra story continues in the linear algebra part of the math appendix.

Linear Algebra

Five ideas carry every later chapter

A vector is numbers with a geometry

.shape answers the first question about any tensor

The transpose flips a matrix across its diagonal

Symmetric matrices are their own transpose

Rank n is just n axes

Same shapes combine element-wise

A scalar broadcasts to every element

A reduction folds many numbers into one

axis= chooses which dimension disappears

keepdims: reduce, but stay broadcastable

The dot product multiplies, then sums

Normalized, the dot product is cos θ

Cauchy–Schwarz keeps the ratio in [−1, 1]

Ax takes one dot product per row

Matrices move vectors: here, a 90° turn

AB stitches m×n dot products into a matrix

Two rulers: ℓ₂ walks straight, ℓ₁ walks the grid

The three norm axioms are checkable facts

Frobenius measures a matrix as one long vector

Eigenvectors: the directions a matrix does not turn

Ten multiplications of a random vector find 8.8612

Recap

`.shape` answers the first question about any tensor

`axis=` chooses which dimension disappears

`keepdims`: reduce, but stay broadcastable