Linear Algebra

Dive into Deep Learning · §1.3

Every model in this book compiles down to a short list of operations
vectors · matrices · products · norms · eigenvalues.

Five ideas carry every later chapter

Motivation

Objects: scalars, vectors, matrices, tensors (ranks 0, 1, 2, n).
Arithmetic: element-wise, plus scalar broadcasting.
Reductions: sum and mean, along chosen axes.
Products: dot, matrix–vector, matrix–matrix.
Norms & eigenvalues: how big, and which directions survive.

Rank = number of axes; shape = size per axis.

The objects

scalars, vectors, matrices, tensors (and one flip)

A vector is numbers with a geometry

The objects

A scalar is a rank-0 tensor: one number. Stack n of them and you get a vector: a data record and an arrow in \mathbb{R}^n, with a length and a direction:

x = jnp.arange(3)
x

Array([0, 1, 2], dtype=int32)

Both readings matter: a row of a dataset is a vector, and so is the direction a training step moves the weights.

`.shape` answers the first question about any tensor

The objects

len counts a vector’s elements:

len(x)

.shape works at every rank, one size per axis:

x.shape

(3,)

The transpose flips a matrix across its diagonal

The objects

A matrix is a rank-2 tensor, m rows \times n columns:

A = jnp.arange(6).reshape(3, 2)
A

Array([[0, 1],
       [2, 3],
       [4, 5]], dtype=int32)

\mathbf{A}^\top swaps the roles of rows and columns:

A.T

Array([[0, 2, 4],
       [1, 3, 5]], dtype=int32)

Symmetric matrices are their own transpose

The objects

\mathbf{A} = \mathbf{A}^\top: the flip changes nothing, and code can check it in one line:

A = jnp.array([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
(A == A.T).all()

Array(True, dtype=bool)

Covariance and Gram matrices are symmetric, a structure that many methods (and this deck’s finale) exploit.

Rank n is just n axes

The objects

Stack matrices and the naming continues. A batch of images is rank-4 (N×C×H×W in PyTorch, N×H×W×C in TF):

jnp.arange(24).reshape(2, 3, 4)

Arithmetic & reduction

element-wise ops · sums that drop or keep axes

Same shapes combine element-wise

Arithmetic

Two tensors of one shape combine entry by entry:

A = jnp.arange(6, dtype=jnp.float32).reshape(2, 3)
B = A
A, A + B

(Array([[0., 1., 2.],
        [3., 4., 5.]], dtype=float32),
 Array([[ 0.,  2.,  4.],
        [ 6.,  8., 10.]], dtype=float32))

The element-wise product is the Hadamard product \mathbf{A} \odot \mathbf{B}:

A * B

Array([[ 0.,  1.,  4.],
       [ 9., 16., 25.]], dtype=float32)

A scalar broadcasts to every element

Arithmetic

Adding or multiplying by a scalar touches each entry and leaves the shape alone:

a = 2
X = jnp.arange(24).reshape(2, 3, 4)
a + X, (a * X).shape

(Array([[[ 2,  3,  4,  5],
         [ 6,  7,  8,  9],
         [10, 11, 12, 13]],
 
        [[14, 15, 16, 17],
         [18, 19, 20, 21],
         [22, 23, 24, 25]]], dtype=int32),
 (2, 3, 4))

A reduction folds many numbers into one

Reduction

sum() with no arguments collapses everything to a scalar:

x = jnp.arange(3, dtype=jnp.float32)
x, x.sum()

(Array([0., 1., 2.], dtype=float32), Array(3., dtype=float32))

mean() is the same fold, divided by the count:

A.mean(), A.sum() / A.size

(Array(2.5, dtype=float32), Array(2.5, dtype=float32))

`axis=` chooses which dimension disappears

Reduction

The output loses exactly the axis you name:

A.shape, A.sum(axis=0).shape

((2, 3), (3,))

A.shape, A.sum(axis=1).shape

((2, 3), (2,))

axis=0 sums down the columns; axis=1 sums across the rows.

`keepdims`: reduce, but stay broadcastable

Reduction

keepdims=True keeps the folded axis at size 1:

sum_A = A.sum(axis=1, keepdims=True)
sum_A, sum_A.shape

(Array([[ 3.],
        [12.]], dtype=float32),
 (2, 1))

…so one broadcast division normalizes every row to sum to 1:

A / sum_A

Array([[0.        , 0.33333334, 0.6666667 ],
       [0.25      , 0.33333334, 0.4166667 ]], dtype=float32)

Products

one idea at three sizes: dot · matrix–vector · matrix–matrix

The dot product multiplies, then sums

Products

\mathbf{x}^\top\mathbf{y} = \sum_i x_i y_i: multiply matching entries, add them up:

y = jnp.ones(3, dtype = jnp.float32)
x, y, jnp.dot(x, y)

(Array([0., 1., 2.], dtype=float32),
 Array([1., 1., 1.], dtype=float32),
 Array(3., dtype=float32))

With nonnegative weights summing to 1, the dot product is a weighted average.

Normalized, the dot product is cos θ

Products · geometry

\cos\theta = \frac{\mathbf{x}^\top\mathbf{y}}{\|\mathbf{x}\|\,\|\mathbf{y}\|}.

+1 aligned · 0 perpendicular · -1 opposed: the dot product is deep learning’s favorite similarity measure.

Cauchy–Schwarz keeps the ratio in [−1, 1]

Products · geometry

Why can \cos\theta never escape [-1, 1]? That is the Cauchy–Schwarz inequality |\mathbf{x}^\top\mathbf{y}| \le \|\mathbf{x}\|\,\|\mathbf{y}\| (proved in the geometry-and-linear-algebraic-operations section). One random pair checks both facts at once:

u, v = jax.random.normal(jax.random.key(0), (2, 8))
cos_theta = jnp.dot(u, v) / (jnp.linalg.norm(u) * jnp.linalg.norm(v))
jnp.arccos(cos_theta), jnp.abs(jnp.dot(u, v)) <= jnp.linalg.norm(u) * jnp.linalg.norm(v)

(Array(1.7202967, dtype=float32), Array(True, dtype=bool))

An angle in [0, \pi], and the inequality holds on every draw.

Ax takes one dot product per row

Products

(\mathbf{A}\mathbf{x})_i = \mathbf{a}^\top_i \mathbf{x}, so a 2\times3 matrix maps a length-3 vector to a length-2 vector:

A.shape, x.shape, jnp.matmul(A, x)

((2, 3), (3,), Array([ 5., 14.], dtype=float32))

Every fully-connected layer computes exactly this (plus a nonlinearity); much more on that later.

Matrices move vectors: here, a 90° turn

Products

Multiplication by \mathbf{A} \in \mathbb{R}^{m\times n} is a linear map \mathbb{R}^n \to \mathbb{R}^m. The rotation matrix \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix} turns the plane by \theta; at \theta = 90° it sends \mathbf{e}_1 \mapsto \mathbf{e}_2 and \mathbf{e}_2 \mapsto -\mathbf{e}_1:

R = jnp.array([[0.0, -1.0], [1.0, 0.0]])  # Rotation by 90 degrees
jnp.matmul(R, jnp.array([1.0, 0.0])), jnp.matmul(R, jnp.array([0.0, 1.0]))

(Array([0., 1.], dtype=float32), Array([-1.,  0.], dtype=float32))

AB stitches m×n dot products into a matrix

Products

Entry c_{ij} is row i of \mathbf{A} dotted with column j of \mathbf{B}:

B = jnp.ones((3, 4))
jnp.matmul(A, B)

Array([[ 3.,  3.,  3.,  3.],
       [12., 12., 12., 12.]], dtype=float32)

Norms & eigenvalues

how long is a vector, and which directions a matrix keeps

Two rulers: ℓ₂ walks straight, ℓ₁ walks the grid

Norms

For \mathbf{u} = [3, -4] the Euclidean ruler reads 5; the taxicab ruler reads 7:

u = jnp.array([3.0, -4.0])
jnp.linalg.norm(u)

Array(5., dtype=float32)

jnp.linalg.norm(u, ord=1) # same as jnp.abs(u).sum()

Array(7., dtype=float32)

\|\mathbf{x}\|_2 = \sqrt{\textstyle\sum_i x_i^2}, \qquad \|\mathbf{x}\|_1 = \textstyle\sum_i |x_i|.

The three norm axioms are checkable facts

Norms

Homogeneity \|\alpha\mathbf{x}\| = |\alpha|\,\|\mathbf{x}\| and the triangle inequality \|\mathbf{x}+\mathbf{y}\| \le \|\mathbf{x}\|+\|\mathbf{y}\|, holding on random vectors:

u, v = jax.random.normal(jax.random.key(1), (2, 6))
alpha = -2.5
print(jnp.linalg.norm(alpha * u), abs(alpha) * jnp.linalg.norm(u))
print(jnp.linalg.norm(u + v) <= jnp.linalg.norm(u) + jnp.linalg.norm(v))

3.259593 3.259593
True

For \ell_2, the triangle inequality is Cauchy–Schwarz in disguise: expand \|\mathbf{u}+\mathbf{v}\|^2 and bound the cross term (the geometry-and-linear-algebraic-operations section).

Frobenius measures a matrix as one long vector

Norms

\|\mathbf{X}\|_\textrm{F} = \sqrt{\sum_{i,j} x_{ij}^2} is the \ell_2 norm of the flattened matrix. For the all-ones 4\times9: \sqrt{36} = 6:

jnp.linalg.norm(jnp.ones((4, 9)))

Array(6., dtype=float32)

The spectral norm (how much \mathbf{X} can stretch a vector) needs the singular value decomposition; it arrives in the SVD-and-low-rank- approximation section.

Eigenvectors: the directions a matrix does not turn

Eigenvalues

\mathbf{A}\mathbf{v} = \lambda\mathbf{v}.

Along an eigenvector, the matrix acts like a scalar: stretch (|\lambda|>1), shrink (|\lambda|<1), flip (\lambda<0), but never turn.

The symmetric matrix from the transpose slide has three real eigenvalues:

S = jnp.array([[1.0, 2, 3], [2, 0, 4], [3, 4, 5]])
jnp.linalg.eigvalsh(S)  # Eigenvalues of a symmetric matrix are real

Array([-2.262772, -0.598474,  8.861247], dtype=float32)

Keep 8.8612 in mind.

Ten multiplications of a random vector find 8.8612

Eigenvalues · payoff

Multiply a random vector by \mathbf{S} ten times and measure how much the norm grows per step:

v = jax.random.normal(jax.random.key(2), (3,))
for _ in range(10):
    prev, v = v, jnp.matmul(S, v)
jnp.linalg.norm(v) / jnp.linalg.norm(prev)

Array(8.861246, dtype=float32)

The growth factor converges to \max_i |\lambda_i| = 8.8612: whatever vector you start from, the largest eigenvalue soon dominates.

Deep networks multiply by dozens of matrices in a row. Whether signals and gradients explode or vanish is this experiment at scale; the analysis returns in the numerical-stability section.

Recap

Wrap-up

Objects: ranks 0–n; the transpose; symmetry.
Element-wise ops + scalar broadcasting; Hadamard \odot.
Reductions: sum/mean with axis=; keepdims= stays broadcastable.
Dot product \sum_i x_i y_i; normalized it is \cos\theta (Cauchy–Schwarz).

\mathbf{A}\mathbf{x}: one dot product per row (the layer primitive); \mathbf{A}\mathbf{B}: m \times n of them.
Norms: \ell_2 = 5 and \ell_1 = 7 for [3,-4]; Frobenius for matrices; three axioms, all checkable.
Eigenvectors are scaled, never turned, and \max|\lambda| = 8.8612 dominated repeated multiplication.

Next, calculus (the calculus section): every gradient there is built from these products. The full linear-algebra story continues in the linear algebra part of the math appendix.

Linear Algebra

Five ideas carry every later chapter

A vector is numbers with a geometry

.shape answers the first question about any tensor

The transpose flips a matrix across its diagonal

Symmetric matrices are their own transpose

Rank n is just n axes

Same shapes combine element-wise

A scalar broadcasts to every element

A reduction folds many numbers into one

axis= chooses which dimension disappears

keepdims: reduce, but stay broadcastable

The dot product multiplies, then sums

Normalized, the dot product is cos θ

Cauchy–Schwarz keeps the ratio in [−1, 1]

Ax takes one dot product per row

Matrices move vectors: here, a 90° turn

AB stitches m×n dot products into a matrix

Two rulers: ℓ₂ walks straight, ℓ₁ walks the grid

The three norm axioms are checkable facts

Frobenius measures a matrix as one long vector

Eigenvectors: the directions a matrix does not turn

Ten multiplications of a random vector find 8.8612

Recap

`.shape` answers the first question about any tensor

`axis=` chooses which dimension disappears

`keepdims`: reduce, but stay broadcastable