Linear Algebra

Dive into Deep Learning · §1.3

Every model in this book compiles down to a short list of operations
vectors · matrices · products · norms · eigenvalues.

Five ideas carry every later chapter

Motivation

Objects: scalars, vectors, matrices, tensors (ranks 0, 1, 2, n).
Arithmetic: element-wise, plus scalar broadcasting.
Reductions: sum and mean, along chosen axes.
Products: dot, matrix–vector, matrix–matrix.
Norms & eigenvalues: how big, and which directions survive.

Rank = number of axes; shape = size per axis.

The objects

scalars, vectors, matrices, tensors (and one flip)

A vector is numbers with a geometry

The objects

A scalar is a rank-0 tensor: one number. Stack n of them and you get a vector: a data record and an arrow in \mathbb{R}^n, with a length and a direction:

x = np.arange(3)
x

array([0., 1., 2.])

Both readings matter: a row of a dataset is a vector, and so is the direction a training step moves the weights.

`.shape` answers the first question about any tensor

The objects

len counts a vector’s elements:

len(x)

.shape works at every rank, one size per axis:

x.shape

(3,)

The transpose flips a matrix across its diagonal

The objects

A matrix is a rank-2 tensor, m rows \times n columns:

A = np.arange(6).reshape(3, 2)
A

array([[0., 1.],
       [2., 3.],
       [4., 5.]])

\mathbf{A}^\top swaps the roles of rows and columns:

A.T

array([[0., 2., 4.],
       [1., 3., 5.]])

Symmetric matrices are their own transpose

The objects

\mathbf{A} = \mathbf{A}^\top: the flip changes nothing, and code can check it in one line:

A = np.array([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
(A == A.T).all()

array(True)

Covariance and Gram matrices are symmetric, a structure that many methods (and this deck’s finale) exploit.

Rank n is just n axes

The objects

Stack matrices and the naming continues. A batch of images is rank-4 (N×C×H×W in PyTorch, N×H×W×C in TF):

np.arange(24).reshape(2, 3, 4)

Arithmetic & reduction

element-wise ops · sums that drop or keep axes

Same shapes combine element-wise

Arithmetic

Two tensors of one shape combine entry by entry:

A = np.arange(6).reshape(2, 3)
B = A.copy()  # Assign a copy of A to B by allocating new memory
A, A + B

(array([[0., 1., 2.],
        [3., 4., 5.]]),
 array([[ 0.,  2.,  4.],
        [ 6.,  8., 10.]]))

The element-wise product is the Hadamard product \mathbf{A} \odot \mathbf{B}:

A * B

array([[ 0.,  1.,  4.],
       [ 9., 16., 25.]])

A scalar broadcasts to every element

Arithmetic

Adding or multiplying by a scalar touches each entry and leaves the shape alone:

a = 2
X = np.arange(24).reshape(2, 3, 4)
a + X, (a * X).shape

(array([[[ 2.,  3.,  4.,  5.],
         [ 6.,  7.,  8.,  9.],
         [10., 11., 12., 13.]],
 
        [[14., 15., 16., 17.],
         [18., 19., 20., 21.],
         [22., 23., 24., 25.]]]),
 (2, 3, 4))

A reduction folds many numbers into one

Reduction

sum() with no arguments collapses everything to a scalar:

x = np.arange(3)
x, x.sum()

(array([0., 1., 2.]), array(3.))

mean() is the same fold, divided by the count:

A.mean(), A.sum() / A.size

(array(2.5), array(2.5))

`axis=` chooses which dimension disappears

Reduction

The output loses exactly the axis you name:

A.shape, A.sum(axis=0).shape

((2, 3), (3,))

A.shape, A.sum(axis=1).shape

((2, 3), (2,))

axis=0 sums down the columns; axis=1 sums across the rows.

`keepdims`: reduce, but stay broadcastable

Reduction

keepdims=True keeps the folded axis at size 1:

sum_A = A.sum(axis=1, keepdims=True)
sum_A, sum_A.shape

(array([[ 3.],
        [12.]]),
 (2, 1))

…so one broadcast division normalizes every row to sum to 1:

A / sum_A

array([[0.        , 0.33333334, 0.6666667 ],
       [0.25      , 0.33333334, 0.41666666]])

Products

one idea at three sizes: dot · matrix–vector · matrix–matrix

The dot product multiplies, then sums

Products

\mathbf{x}^\top\mathbf{y} = \sum_i x_i y_i: multiply matching entries, add them up:

y = np.ones(3)
x, y, np.dot(x, y)

(array([0., 1., 2.]), array([1., 1., 1.]), array(3.))

With nonnegative weights summing to 1, the dot product is a weighted average.

Normalized, the dot product is cos θ

Products · geometry

\cos\theta = \frac{\mathbf{x}^\top\mathbf{y}}{\|\mathbf{x}\|\,\|\mathbf{y}\|}.

+1 aligned · 0 perpendicular · -1 opposed: the dot product is deep learning’s favorite similarity measure.

Cauchy–Schwarz keeps the ratio in [−1, 1]

Products · geometry

Why can \cos\theta never escape [-1, 1]? That is the Cauchy–Schwarz inequality |\mathbf{x}^\top\mathbf{y}| \le \|\mathbf{x}\|\,\|\mathbf{y}\| (proved in the geometry-and-linear-algebraic-operations section). One random pair checks both facts at once:

u, v = np.random.normal(size=8), np.random.normal(size=8)
cos_theta = np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))
np.arccos(cos_theta), np.abs(np.dot(u, v)) <= np.linalg.norm(u) * np.linalg.norm(v)

(array(1.5436045), array(True))

An angle in [0, \pi], and the inequality holds on every draw.

Ax takes one dot product per row

Products

(\mathbf{A}\mathbf{x})_i = \mathbf{a}^\top_i \mathbf{x}, so a 2\times3 matrix maps a length-3 vector to a length-2 vector:

A.shape, x.shape, np.dot(A, x)

((2, 3), (3,), array([ 5., 14.]))

Every fully-connected layer computes exactly this (plus a nonlinearity); much more on that later.

Matrices move vectors: here, a 90° turn

Products

Multiplication by \mathbf{A} \in \mathbb{R}^{m\times n} is a linear map \mathbb{R}^n \to \mathbb{R}^m. The rotation matrix \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix} turns the plane by \theta; at \theta = 90° it sends \mathbf{e}_1 \mapsto \mathbf{e}_2 and \mathbf{e}_2 \mapsto -\mathbf{e}_1:

R = np.array([[0.0, -1.0], [1.0, 0.0]])  # Rotation by 90 degrees
np.dot(R, np.array([1.0, 0.0])), np.dot(R, np.array([0.0, 1.0]))

(array([0., 1.]), array([-1.,  0.]))

AB stitches m×n dot products into a matrix

Products

Entry c_{ij} is row i of \mathbf{A} dotted with column j of \mathbf{B}:

B = np.ones(shape=(3, 4))
np.dot(A, B)

array([[ 3.,  3.,  3.,  3.],
       [12., 12., 12., 12.]])

Norms & eigenvalues

how long is a vector, and which directions a matrix keeps

Two rulers: ℓ₂ walks straight, ℓ₁ walks the grid

Norms

For \mathbf{u} = [3, -4] the Euclidean ruler reads 5; the taxicab ruler reads 7:

u = np.array([3, -4])
np.linalg.norm(u)

array(5.)

np.abs(u).sum()

array(7.)

\|\mathbf{x}\|_2 = \sqrt{\textstyle\sum_i x_i^2}, \qquad \|\mathbf{x}\|_1 = \textstyle\sum_i |x_i|.

The three norm axioms are checkable facts

Norms

Homogeneity \|\alpha\mathbf{x}\| = |\alpha|\,\|\mathbf{x}\| and the triangle inequality \|\mathbf{x}+\mathbf{y}\| \le \|\mathbf{x}\|+\|\mathbf{y}\|, holding on random vectors:

u, v, alpha = np.random.normal(size=6), np.random.normal(size=6), -2.5
print(np.linalg.norm(alpha * u), abs(alpha) * np.linalg.norm(u))
print(np.linalg.norm(u + v) <= np.linalg.norm(u) + np.linalg.norm(v))

3.647475 3.6474748
True

For \ell_2, the triangle inequality is Cauchy–Schwarz in disguise: expand \|\mathbf{u}+\mathbf{v}\|^2 and bound the cross term (the geometry-and-linear-algebraic-operations section).

Frobenius measures a matrix as one long vector

Norms

\|\mathbf{X}\|_\textrm{F} = \sqrt{\sum_{i,j} x_{ij}^2} is the \ell_2 norm of the flattened matrix. For the all-ones 4\times9: \sqrt{36} = 6:

np.linalg.norm(np.ones((4, 9)))

array(6.)

The spectral norm (how much \mathbf{X} can stretch a vector) needs the singular value decomposition; it arrives in the SVD-and-low-rank- approximation section.

Eigenvectors: the directions a matrix does not turn

Eigenvalues

\mathbf{A}\mathbf{v} = \lambda\mathbf{v}.

Along an eigenvector, the matrix acts like a scalar: stretch (|\lambda|>1), shrink (|\lambda|<1), flip (\lambda<0), but never turn.

The symmetric matrix from the transpose slide has three real eigenvalues:

S = np.array([[1.0, 2, 3], [2, 0, 4], [3, 4, 5]])
np.linalg.eigvalsh(S)  # Eigenvalues of a symmetric matrix are real

array([-2.2627723, -0.5984747,  8.861247 ])

Keep 8.8612 in mind.

Ten multiplications of a random vector find 8.8612

Eigenvalues · payoff

Multiply a random vector by \mathbf{S} ten times and measure how much the norm grows per step:

v = np.random.normal(size=3)
for _ in range(10):
    prev, v = v, np.dot(S, v)
np.linalg.norm(v) / np.linalg.norm(prev)

array(8.861246)

The growth factor converges to \max_i |\lambda_i| = 8.8612: whatever vector you start from, the largest eigenvalue soon dominates.

Deep networks multiply by dozens of matrices in a row. Whether signals and gradients explode or vanish is this experiment at scale; the analysis returns in the numerical-stability section.

Recap

Wrap-up

Objects: ranks 0–n; the transpose; symmetry.
Element-wise ops + scalar broadcasting; Hadamard \odot.
Reductions: sum/mean with axis=; keepdims= stays broadcastable.
Dot product \sum_i x_i y_i; normalized it is \cos\theta (Cauchy–Schwarz).

\mathbf{A}\mathbf{x}: one dot product per row (the layer primitive); \mathbf{A}\mathbf{B}: m \times n of them.
Norms: \ell_2 = 5 and \ell_1 = 7 for [3,-4]; Frobenius for matrices; three axioms, all checkable.
Eigenvectors are scaled, never turned, and \max|\lambda| = 8.8612 dominated repeated multiplication.

Next, calculus (the calculus section): every gradient there is built from these products. The full linear-algebra story continues in the linear algebra part of the math appendix.

Linear Algebra

Five ideas carry every later chapter

A vector is numbers with a geometry

.shape answers the first question about any tensor

The transpose flips a matrix across its diagonal

Symmetric matrices are their own transpose

Rank n is just n axes

Same shapes combine element-wise

A scalar broadcasts to every element

A reduction folds many numbers into one

axis= chooses which dimension disappears

keepdims: reduce, but stay broadcastable

The dot product multiplies, then sums

Normalized, the dot product is cos θ

Cauchy–Schwarz keeps the ratio in [−1, 1]

Ax takes one dot product per row

Matrices move vectors: here, a 90° turn

AB stitches m×n dot products into a matrix

Two rulers: ℓ₂ walks straight, ℓ₁ walks the grid

The three norm axioms are checkable facts

Frobenius measures a matrix as one long vector

Eigenvectors: the directions a matrix does not turn

Ten multiplications of a random vector find 8.8612

Recap

`.shape` answers the first question about any tensor

`axis=` chooses which dimension disappears

`keepdims`: reduce, but stay broadcastable