Dot products and angles

Geometry and Linear Algebraic Operations

Geometry of Linear Algebra

The geometric intuitions behind the linear algebra used throughout the book. Two viewpoints on a vector \mathbf{v}:

A position — a point in space.
A direction — an arrow from the origin.

Most of deep learning works in the second view. From it we get dot products (similarity), angles, projections, hyperplanes (decision boundaries), and determinants (volume changes).

Vectors as geometry

The same array can name a point or a displacement. Deep learning mostly uses the displacement view: directions, lengths, and angles.

v = [1, 7, 0, 1]

\mathbf{u}^\top \mathbf{v} = \|\mathbf{u}\| \|\mathbf{v}\| \cos\theta. Cosine similarity = normalized dot product. The metric behind kernel methods, attention, and contrastive learning:

%matplotlib inline
from d2l import jax as d2l
from IPython import display
import jax
from jax import numpy as jnp
import numpy as np

def angle(v, w):
    return jnp.arccos(jnp.dot(v, w) / (jnp.linalg.norm(v) * jnp.linalg.norm(w)))

angle(jnp.array([0, 1, 2], dtype=jnp.float32), jnp.array([2.0, 3, 4]))

Array(0.41899002, dtype=float32)

Hyperplanes as classifiers

A hyperplane is the set \{\mathbf{x} : \mathbf{w}^\top \mathbf{x} = b\}. Linear classifiers split space with one — sign of the dot product gives the prediction. Most of deep learning is “learn good features so a hyperplane works”:

# Load in the dataset
import tensorflow as tf
((train_images, train_labels), (
    test_images, test_labels)) = tf.keras.datasets.fashion_mnist.load_data()

X_train_0 = jnp.array(train_images[train_labels == 0], dtype=jnp.float32) * 256
X_train_1 = jnp.array(train_images[train_labels == 1], dtype=jnp.float32) * 256
X_test = jnp.array(
    test_images[(test_labels == 0) | (test_labels == 1)], dtype=jnp.float32) * 256
y_test = jnp.array(
    test_labels[(test_labels == 0) | (test_labels == 1)], dtype=jnp.float32)

# Compute averages
ave_0 = jnp.mean(X_train_0, axis=0)
ave_1 = jnp.mean(X_train_1, axis=0)

# Plot average t-shirt
d2l.set_figsize()
d2l.plt.imshow(np.array(ave_0.reshape(28, 28)), cmap='Greys')
d2l.plt.show()

Hyperplanes (cont.)

Changing \mathbf{w} rotates the boundary; changing b shifts it. Normalized distance to the boundary is a margin.

# Plot average trousers
d2l.plt.imshow(np.array(ave_1.reshape(28, 28)), cmap='Greys')
d2l.plt.show()

# Print test set accuracy with eyeballed threshold
w = ave_1 - ave_0
predictions = X_test.reshape(2000, -1) @ w.flatten() > -1500000

# Accuracy
jnp.mean((predictions.astype(y_test.dtype) == y_test).astype(jnp.float32))

Array(0.892, dtype=float32)

Invertibility and determinant

Square matrices are invertible iff they don’t collapse volumes. The determinant measures the signed volume scale factor:

M = jnp.array([[1, 2], [1, 4]], dtype=jnp.float32)
M_inv = jnp.array([[2, -1], [-0.5, 0.5]])
M_inv @ M

Array([[1., 0.],
       [0., 1.]], dtype=float32)

jnp.linalg.det(jnp.array([[1, -1], [2, 3]], dtype=jnp.float32))

Array(5., dtype=float32)

In code

Translate all of this into NumPy / PyTorch:

# Define tensors
B = jnp.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
A = jnp.array([[1, 2], [3, 4]])
v = jnp.array([1, 2])

# Print out the shapes
A.shape, B.shape, v.shape

((2, 2), (2, 2, 3), (2,))

# Reimplement matrix multiplication
jnp.einsum("ij, j -> i", A, v), A @ v

(Array([ 5, 11], dtype=int32), Array([ 5, 11], dtype=int32))

In code (cont.)

These final snippets connect the geometric ideas to the actual linear-algebra APIs for norms, determinants, and inverses.

jnp.einsum("ijk, il, j -> kl", B, A, v)

Array([[ 90, 126],
       [102, 144],
       [114, 162]], dtype=int32)

jnp.einsum(B, [0, 1, 2], A, [0, 3], v, [1], [2, 3])

Array([[ 90, 126],
       [102, 144],
       [114, 162]], dtype=int32)

Recap

Vectors as directions; dot products = cosine similarity; matrices = linear maps; determinant = volume scale.
Hyperplanes are the decision-boundary primitive of every linear classifier and every linear layer.
These geometric pictures keep being useful all the way up to attention and high-dim embeddings.