Dot products and angles

Geometry and Linear Algebraic Operations

Geometry of Linear Algebra

The geometric intuitions behind the linear algebra used throughout the book. Two viewpoints on a vector \mathbf{v}:

A position — a point in space.
A direction — an arrow from the origin.

Most of deep learning works in the second view. From it we get dot products (similarity), angles, projections, hyperplanes (decision boundaries), and determinants (volume changes).

Vectors as geometry

The same array can name a point or a displacement. Deep learning mostly uses the displacement view: directions, lengths, and angles.

v = [1, 7, 0, 1]

\mathbf{u}^\top \mathbf{v} = \|\mathbf{u}\| \|\mathbf{v}\| \cos\theta. Cosine similarity = normalized dot product. The metric behind kernel methods, attention, and contrastive learning:

%matplotlib inline
from d2l import mxnet as d2l
from IPython import display
from mxnet import gluon, np, npx
npx.set_np()

def angle(v, w):
    return np.arccos(v.dot(w) / (np.linalg.norm(v) * np.linalg.norm(w)))

angle(np.array([0, 1, 2]), np.array([2, 3, 4]))

Hyperplanes as classifiers

A hyperplane is the set \{\mathbf{x} : \mathbf{w}^\top \mathbf{x} = b\}. Linear classifiers split space with one — sign of the dot product gives the prediction. Most of deep learning is “learn good features so a hyperplane works”:

# Load in the dataset
train = gluon.data.vision.FashionMNIST(train=True)
test = gluon.data.vision.FashionMNIST(train=False)

# In MXNet 2.0 reductions over `float` (== float64) inputs stay float64, but
# many fused kernels still emit float32 — pin everything to float32 up front so
# downstream dot products see matching dtypes.
X_train_0 = np.stack([x[0] for x in train if x[1] == 0]).astype('float32')
X_train_1 = np.stack([x[0] for x in train if x[1] == 1]).astype('float32')
X_test = np.stack(
    [x[0] for x in test if x[1] == 0 or x[1] == 1]).astype('float32')
y_test = np.stack(
    [x[1] for x in test if x[1] == 0 or x[1] == 1]).astype('float32')

# Compute averages
ave_0 = np.mean(X_train_0, axis=0)
ave_1 = np.mean(X_train_1, axis=0)

# Plot average t-shirt
d2l.set_figsize()
d2l.plt.imshow(ave_0.reshape(28, 28).tolist(), cmap='Greys')
d2l.plt.show()

Hyperplanes (cont.)

Changing \mathbf{w} rotates the boundary; changing b shifts it. Normalized distance to the boundary is a margin.

# Plot average trousers
d2l.plt.imshow(ave_1.reshape(28, 28).tolist(), cmap='Greys')
d2l.plt.show()

# Print test set accuracy with eyeballed threshold
w = (ave_1 - ave_0).T
predictions = X_test.reshape(2000, -1).dot(w.flatten()) > -1500000

# Accuracy
np.mean(predictions.astype(y_test.dtype) == y_test, dtype=np.float64)

Invertibility and determinant

Square matrices are invertible iff they don’t collapse volumes. The determinant measures the signed volume scale factor:

M = np.array([[1, 2], [1, 4]])
M_inv = np.array([[2, -1], [-0.5, 0.5]])
M_inv.dot(M)

import numpy as np
np.linalg.det(np.array([[1, -1], [2, 3]]))

In code

Translate all of this into NumPy / PyTorch:

# Define tensors
B = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
A = np.array([[1, 2], [3, 4]])
v = np.array([1, 2])

# Print out the shapes
A.shape, B.shape, v.shape

# Reimplement matrix multiplication
np.einsum("ij, j -> i", A, v), A.dot(v)

In code (cont.)

These final snippets connect the geometric ideas to the actual linear-algebra APIs for norms, determinants, and inverses.

np.einsum("ijk, il, j -> kl", B, A, v)

np.einsum(B, [0, 1, 2], A, [0, 3], v, [1], [2, 3])

Recap

Vectors as directions; dot products = cosine similarity; matrices = linear maps; determinant = volume scale.
Hyperplanes are the decision-boundary primitive of every linear classifier and every linear layer.
These geometric pictures keep being useful all the way up to attention and high-dim embeddings.