Linear Algebra

Dive into Deep Learning · §1.3

Every model in this book compiles down to a short list of operations
vectors · matrices · products · norms · eigenvalues.

Five ideas carry every later chapter

Motivation

Objects: scalars, vectors, matrices, tensors (ranks 0, 1, 2, n).
Arithmetic: element-wise, plus scalar broadcasting.
Reductions: sum and mean, along chosen axes.
Products: dot, matrix–vector, matrix–matrix.
Norms & eigenvalues: how big, and which directions survive.

Rank = number of axes; shape = size per axis.

The objects

scalars, vectors, matrices, tensors (and one flip)

A vector is numbers with a geometry

The objects

A scalar is a rank-0 tensor: one number. Stack n of them and you get a vector: a data record and an arrow in \mathbb{R}^n, with a length and a direction:

x = tf.range(3)
x

<tf.Tensor: shape=(3,), dtype=int32, numpy=array([0, 1, 2], dtype=int32)>

Both readings matter: a row of a dataset is a vector, and so is the direction a training step moves the weights.

`.shape` answers the first question about any tensor

The objects

len counts a vector’s elements:

len(x)

.shape works at every rank, one size per axis:

x.shape

TensorShape([3])

The transpose flips a matrix across its diagonal

The objects

A matrix is a rank-2 tensor, m rows \times n columns:

A = tf.reshape(tf.range(6), (3, 2))
A

<tf.Tensor: shape=(3, 2), dtype=int32, numpy=
array([[0, 1],
       [2, 3],
       [4, 5]], dtype=int32)>

\mathbf{A}^\top swaps the roles of rows and columns:

tf.transpose(A)

<tf.Tensor: shape=(2, 3), dtype=int32, numpy=
array([[0, 2, 4],
       [1, 3, 5]], dtype=int32)>

Symmetric matrices are their own transpose

The objects

\mathbf{A} = \mathbf{A}^\top: the flip changes nothing, and code can check it in one line:

A = tf.constant([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
tf.reduce_all(A == tf.transpose(A))

<tf.Tensor: shape=(), dtype=bool, numpy=True>

Covariance and Gram matrices are symmetric, a structure that many methods (and this deck’s finale) exploit.

Rank n is just n axes

The objects

Stack matrices and the naming continues. A batch of images is rank-4 (N×C×H×W in PyTorch, N×H×W×C in TF):

tf.reshape(tf.range(24), (2, 3, 4))

Arithmetic & reduction

element-wise ops · sums that drop or keep axes

Same shapes combine element-wise

Arithmetic

Two tensors of one shape combine entry by entry:

A = tf.reshape(tf.range(6, dtype=tf.float32), (2, 3))
B = A  # No cloning of A to B by allocating new memory
A, A + B

(<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
 array([[0., 1., 2.],
        [3., 4., 5.]], dtype=float32)>,
 <tf.Tensor: shape=(2, 3), dtype=float32, numpy=
 array([[ 0.,  2.,  4.],
        [ 6.,  8., 10.]], dtype=float32)>)

The element-wise product is the Hadamard product \mathbf{A} \odot \mathbf{B}:

A * B

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[ 0.,  1.,  4.],
       [ 9., 16., 25.]], dtype=float32)>

A scalar broadcasts to every element

Arithmetic

Adding or multiplying by a scalar touches each entry and leaves the shape alone:

a = 2
X = tf.reshape(tf.range(24), (2, 3, 4))
a + X, (a * X).shape

(<tf.Tensor: shape=(2, 3, 4), dtype=int32, numpy=
 array([[[ 2,  3,  4,  5],
         [ 6,  7,  8,  9],
         [10, 11, 12, 13]],
 
        [[14, 15, 16, 17],
         [18, 19, 20, 21],
         [22, 23, 24, 25]]], dtype=int32)>,
 TensorShape([2, 3, 4]))

A reduction folds many numbers into one

Reduction

sum() with no arguments collapses everything to a scalar:

x = tf.range(3, dtype=tf.float32)
x, tf.reduce_sum(x)

(<tf.Tensor: shape=(3,), dtype=float32, numpy=array([0., 1., 2.], dtype=float32)>,
 <tf.Tensor: shape=(), dtype=float32, numpy=3.0>)

mean() is the same fold, divided by the count:

tf.reduce_mean(A), tf.reduce_sum(A) / tf.size(A).numpy()

(<tf.Tensor: shape=(), dtype=float32, numpy=2.5>,
 <tf.Tensor: shape=(), dtype=float32, numpy=2.5>)

`axis=` chooses which dimension disappears

Reduction

The output loses exactly the axis you name:

A.shape, tf.reduce_sum(A, axis=0).shape

(TensorShape([2, 3]), TensorShape([3]))

A.shape, tf.reduce_sum(A, axis=1).shape

(TensorShape([2, 3]), TensorShape([2]))

axis=0 sums down the columns; axis=1 sums across the rows.

`keepdims`: reduce, but stay broadcastable

Reduction

keepdims=True keeps the folded axis at size 1:

sum_A = tf.reduce_sum(A, axis=1, keepdims=True)
sum_A, sum_A.shape

(<tf.Tensor: shape=(2, 1), dtype=float32, numpy=
 array([[ 3.],
        [12.]], dtype=float32)>,
 TensorShape([2, 1]))

…so one broadcast division normalizes every row to sum to 1:

A / sum_A

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[0.        , 0.33333334, 0.6666667 ],
       [0.25      , 0.33333334, 0.41666666]], dtype=float32)>

Products

one idea at three sizes: dot · matrix–vector · matrix–matrix

The dot product multiplies, then sums

Products

\mathbf{x}^\top\mathbf{y} = \sum_i x_i y_i: multiply matching entries, add them up:

y = tf.ones(3, dtype=tf.float32)
x, y, tf.tensordot(x, y, axes=1)

(<tf.Tensor: shape=(3,), dtype=float32, numpy=array([0., 1., 2.], dtype=float32)>,
 <tf.Tensor: shape=(3,), dtype=float32, numpy=array([1., 1., 1.], dtype=float32)>,
 <tf.Tensor: shape=(), dtype=float32, numpy=3.0>)

With nonnegative weights summing to 1, the dot product is a weighted average.

Normalized, the dot product is cos θ

Products · geometry

\cos\theta = \frac{\mathbf{x}^\top\mathbf{y}}{\|\mathbf{x}\|\,\|\mathbf{y}\|}.

+1 aligned · 0 perpendicular · -1 opposed: the dot product is deep learning’s favorite similarity measure.

Cauchy–Schwarz keeps the ratio in [−1, 1]

Products · geometry

Why can \cos\theta never escape [-1, 1]? That is the Cauchy–Schwarz inequality |\mathbf{x}^\top\mathbf{y}| \le \|\mathbf{x}\|\,\|\mathbf{y}\| (proved in the geometry-and-linear-algebraic-operations section). One random pair checks both facts at once:

u, v = tf.random.normal((8,)), tf.random.normal((8,))
cos_theta = tf.tensordot(u, v, axes=1) / (tf.norm(u) * tf.norm(v))
tf.acos(cos_theta), tf.abs(tf.tensordot(u, v, axes=1)) <= tf.norm(u) * tf.norm(v)

(<tf.Tensor: shape=(), dtype=float32, numpy=1.7644957304000854>,
 <tf.Tensor: shape=(), dtype=bool, numpy=True>)

An angle in [0, \pi], and the inequality holds on every draw.

Ax takes one dot product per row

Products

(\mathbf{A}\mathbf{x})_i = \mathbf{a}^\top_i \mathbf{x}, so a 2\times3 matrix maps a length-3 vector to a length-2 vector:

A.shape, x.shape, tf.linalg.matvec(A, x)

(TensorShape([2, 3]),
 TensorShape([3]),
 <tf.Tensor: shape=(2,), dtype=float32, numpy=array([ 5., 14.], dtype=float32)>)

Every fully-connected layer computes exactly this (plus a nonlinearity); much more on that later.

Matrices move vectors: here, a 90° turn

Products

Multiplication by \mathbf{A} \in \mathbb{R}^{m\times n} is a linear map \mathbb{R}^n \to \mathbb{R}^m. The rotation matrix \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix} turns the plane by \theta; at \theta = 90° it sends \mathbf{e}_1 \mapsto \mathbf{e}_2 and \mathbf{e}_2 \mapsto -\mathbf{e}_1:

R = tf.constant([[0.0, -1.0], [1.0, 0.0]])  # Rotation by 90 degrees
tf.linalg.matvec(R, tf.constant([1.0, 0.0])), tf.linalg.matvec(R, tf.constant([0.0, 1.0]))

(<tf.Tensor: shape=(2,), dtype=float32, numpy=array([0., 1.], dtype=float32)>,
 <tf.Tensor: shape=(2,), dtype=float32, numpy=array([-1.,  0.], dtype=float32)>)

AB stitches m×n dot products into a matrix

Products

Entry c_{ij} is row i of \mathbf{A} dotted with column j of \mathbf{B}:

B = tf.ones((3, 4), tf.float32)
tf.matmul(A, B)

<tf.Tensor: shape=(2, 4), dtype=float32, numpy=
array([[ 3.,  3.,  3.,  3.],
       [12., 12., 12., 12.]], dtype=float32)>

Norms & eigenvalues

how long is a vector, and which directions a matrix keeps

Two rulers: ℓ₂ walks straight, ℓ₁ walks the grid

Norms

For \mathbf{u} = [3, -4] the Euclidean ruler reads 5; the taxicab ruler reads 7:

u = tf.constant([3.0, -4.0])
tf.norm(u)

<tf.Tensor: shape=(), dtype=float32, numpy=5.0>

tf.reduce_sum(tf.abs(u))

<tf.Tensor: shape=(), dtype=float32, numpy=7.0>

\|\mathbf{x}\|_2 = \sqrt{\textstyle\sum_i x_i^2}, \qquad \|\mathbf{x}\|_1 = \textstyle\sum_i |x_i|.

The three norm axioms are checkable facts

Norms

Homogeneity \|\alpha\mathbf{x}\| = |\alpha|\,\|\mathbf{x}\| and the triangle inequality \|\mathbf{x}+\mathbf{y}\| \le \|\mathbf{x}\|+\|\mathbf{y}\|, holding on random vectors:

u, v, alpha = tf.random.normal((6,)), tf.random.normal((6,)), -2.5
print(tf.norm(alpha * u), abs(alpha) * tf.norm(u))
print(tf.norm(u + v) <= tf.norm(u) + tf.norm(v))

tf.Tensor(6.149077, shape=(), dtype=float32) tf.Tensor(6.149077, shape=(), dtype=float32)
tf.Tensor(True, shape=(), dtype=bool)

For \ell_2, the triangle inequality is Cauchy–Schwarz in disguise: expand \|\mathbf{u}+\mathbf{v}\|^2 and bound the cross term (the geometry-and-linear-algebraic-operations section).

Frobenius measures a matrix as one long vector

Norms

\|\mathbf{X}\|_\textrm{F} = \sqrt{\sum_{i,j} x_{ij}^2} is the \ell_2 norm of the flattened matrix. For the all-ones 4\times9: \sqrt{36} = 6:

tf.norm(tf.ones((4, 9)))

<tf.Tensor: shape=(), dtype=float32, numpy=6.0>

The spectral norm (how much \mathbf{X} can stretch a vector) needs the singular value decomposition; it arrives in the SVD-and-low-rank- approximation section.

Eigenvectors: the directions a matrix does not turn

Eigenvalues

\mathbf{A}\mathbf{v} = \lambda\mathbf{v}.

Along an eigenvector, the matrix acts like a scalar: stretch (|\lambda|>1), shrink (|\lambda|<1), flip (\lambda<0), but never turn.

The symmetric matrix from the transpose slide has three real eigenvalues:

S = tf.constant([[1.0, 2, 3], [2, 0, 4], [3, 4, 5]])
tf.linalg.eigvalsh(S)  # Eigenvalues of a symmetric matrix are real

<tf.Tensor: shape=(3,), dtype=float32, numpy=array([-2.2627726, -0.5984748,  8.861248 ], dtype=float32)>

Keep 8.8612 in mind.

Ten multiplications of a random vector find 8.8612

Eigenvalues · payoff

Multiply a random vector by \mathbf{S} ten times and measure how much the norm grows per step:

v = tf.random.normal((3,))
for _ in range(10):
    prev, v = v, tf.linalg.matvec(S, v)
tf.norm(v) / tf.norm(prev)

<tf.Tensor: shape=(), dtype=float32, numpy=8.861246109008789>

The growth factor converges to \max_i |\lambda_i| = 8.8612: whatever vector you start from, the largest eigenvalue soon dominates.

Deep networks multiply by dozens of matrices in a row. Whether signals and gradients explode or vanish is this experiment at scale; the analysis returns in the numerical-stability section.

Recap

Wrap-up

Objects: ranks 0–n; the transpose; symmetry.
Element-wise ops + scalar broadcasting; Hadamard \odot.
Reductions: sum/mean with axis=; keepdims= stays broadcastable.
Dot product \sum_i x_i y_i; normalized it is \cos\theta (Cauchy–Schwarz).

\mathbf{A}\mathbf{x}: one dot product per row (the layer primitive); \mathbf{A}\mathbf{B}: m \times n of them.
Norms: \ell_2 = 5 and \ell_1 = 7 for [3,-4]; Frobenius for matrices; three axioms, all checkable.
Eigenvectors are scaled, never turned, and \max|\lambda| = 8.8612 dominated repeated multiplication.

Next, calculus (the calculus section): every gradient there is built from these products. The full linear-algebra story continues in the linear algebra part of the math appendix.

Linear Algebra

Five ideas carry every later chapter

A vector is numbers with a geometry

.shape answers the first question about any tensor

The transpose flips a matrix across its diagonal

Symmetric matrices are their own transpose

Rank n is just n axes

Same shapes combine element-wise

A scalar broadcasts to every element

A reduction folds many numbers into one

axis= chooses which dimension disappears

keepdims: reduce, but stay broadcastable

The dot product multiplies, then sums

Normalized, the dot product is cos θ

Cauchy–Schwarz keeps the ratio in [−1, 1]

Ax takes one dot product per row

Matrices move vectors: here, a 90° turn

AB stitches m×n dot products into a matrix

Two rulers: ℓ₂ walks straight, ℓ₁ walks the grid

The three norm axioms are checkable facts

Frobenius measures a matrix as one long vector

Eigenvectors: the directions a matrix does not turn

Ten multiplications of a random vector find 8.8612

Recap

`.shape` answers the first question about any tensor

`axis=` chooses which dimension disappears

`keepdims`: reduce, but stay broadcastable