Dot products and angles

Geometry and Linear Algebraic Operations

Geometry of Linear Algebra

The geometric intuitions behind the linear algebra used throughout the book. Two viewpoints on a vector \mathbf{v}:

  • A position — a point in space.
  • A direction — an arrow from the origin.

Most of deep learning works in the second view. From it we get dot products (similarity), angles, projections, hyperplanes (decision boundaries), and determinants (volume changes).

Vectors as geometry

The same array can name a point or a displacement. Deep learning mostly uses the displacement view: directions, lengths, and angles.

v = [1, 7, 0, 1]

\mathbf{u}^\top \mathbf{v} = \|\mathbf{u}\| \|\mathbf{v}\| \cos\theta. Cosine similarity = normalized dot product. The metric behind kernel methods, attention, and contrastive learning:

%matplotlib inline
from d2l import tensorflow as d2l
from IPython import display
import tensorflow as tf

def angle(v, w):
    return tf.acos(tf.tensordot(v, w, axes=1) / (tf.norm(v) * tf.norm(w)))

angle(tf.constant([0, 1, 2], dtype=tf.float32), tf.constant([2.0, 3, 4]))
<tf.Tensor: shape=(), dtype=float32, numpy=0.41899001598358154>

Hyperplanes as classifiers

A hyperplane is the set \{\mathbf{x} : \mathbf{w}^\top \mathbf{x} = b\}. Linear classifiers split space with one — sign of the dot product gives the prediction. Most of deep learning is “learn good features so a hyperplane works”:

# Load in the dataset
((train_images, train_labels), (
    test_images, test_labels)) = tf.keras.datasets.fashion_mnist.load_data()


X_train_0 = tf.cast(tf.stack(train_images[[i for i, label in enumerate(
    train_labels) if label == 0]]), dtype=tf.float32) * 256
X_train_1 = tf.cast(tf.stack(train_images[[i for i, label in enumerate(
    train_labels) if label == 1]]), dtype=tf.float32) * 256
X_test = tf.cast(tf.stack(test_images[[i for i, label in enumerate(
    test_labels) if label == 0 or label == 1]]),
    dtype=tf.float32) * 256
y_test = tf.cast(tf.stack([label for label in test_labels
    if label == 0 or label == 1]), dtype=tf.float32)

# Compute averages
ave_0 = tf.reduce_mean(X_train_0, axis=0)
ave_1 = tf.reduce_mean(X_train_1, axis=0)
# Plot average t-shirt
d2l.set_figsize()
d2l.plt.imshow(tf.reshape(ave_0, (28, 28)), cmap='Greys')
d2l.plt.show()

Hyperplanes (cont.)

Changing \mathbf{w} rotates the boundary; changing b shifts it. Normalized distance to the boundary is a margin.

# Plot average trousers
d2l.plt.imshow(tf.reshape(ave_1, (28, 28)), cmap='Greys')
d2l.plt.show()

# Print test set accuracy with eyeballed threshold
w = tf.transpose(ave_1 - ave_0)
predictions = tf.reduce_sum(X_test * tf.nest.flatten(w), axis=0) > -1500000

# Accuracy
tf.reduce_mean(
    tf.cast(tf.cast(predictions, y_test.dtype) == y_test, tf.float32))
<tf.Tensor: shape=(), dtype=float32, numpy=0.0>

Invertibility and determinant

Square matrices are invertible iff they don’t collapse volumes. The determinant measures the signed volume scale factor:

M = tf.constant([[1, 2], [1, 4]], dtype=tf.float32)
M_inv = tf.constant([[2, -1], [-0.5, 0.5]])
tf.matmul(M_inv, M)
<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[1., 0.],
       [0., 1.]], dtype=float32)>
tf.linalg.det(tf.constant([[1, -1], [2, 3]], dtype=tf.float32))
<tf.Tensor: shape=(), dtype=float32, numpy=5.0>

In code

Translate all of this into NumPy / PyTorch:

# Define tensors
B = tf.constant([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
A = tf.constant([[1, 2], [3, 4]])
v = tf.constant([1, 2])

# Print out the shapes
A.shape, B.shape, v.shape
(TensorShape([2, 2]), TensorShape([2, 2, 3]), TensorShape([2]))
# Reimplement matrix multiplication
tf.einsum("ij, j -> i", A, v), tf.matmul(A, tf.reshape(v, (2, 1)))
(<tf.Tensor: shape=(2,), dtype=int32, numpy=array([ 5, 11], dtype=int32)>,
 <tf.Tensor: shape=(2, 1), dtype=int32, numpy=
 array([[ 5],
        [11]], dtype=int32)>)

In code (cont.)

These final snippets connect the geometric ideas to the actual linear-algebra APIs for norms, determinants, and inverses.

tf.einsum("ijk, il, j -> kl", B, A, v)
<tf.Tensor: shape=(3, 2), dtype=int32, numpy=
array([[ 90, 126],
       [102, 144],
       [114, 162]], dtype=int32)>
# TensorFlow does not support this type of notation.

Recap

  • Vectors as directions; dot products = cosine similarity; matrices = linear maps; determinant = volume scale.
  • Hyperplanes are the decision-boundary primitive of every linear classifier and every linear layer.
  • These geometric pictures keep being useful all the way up to attention and high-dim embeddings.