Automatic Differentiation

Dive into Deep Learning · §1.5

From the chain rule to backpropagation
the engine that differentiates a whole network for you.

Record the forward pass; replay it in reverse

Motivation

Hand-deriving gradients for a million-parameter network is hopeless. Instead the framework records each operation as you run the forward pass, then replays it in reverse, applying the chain rule of the calculus section mechanically, to get the gradient w.r.t. every input at once.

Every training step in this book is one forward pass and one backward pass over this graph.

The mechanics

record forward · sweep backward

A function with a known answer: ∇y = 4x

Mechanics

Differentiate y = 2\,\mathbf{x}^\top\mathbf{x} w.r.t. the vector \mathbf{x}. The analytic answer, \nabla_\mathbf{x} y = 4\mathbf{x}, is our sanity check: autograd must reproduce it exactly.

x = jnp.arange(4.0)
x

Array([0., 1., 2., 3.], dtype=float32)

No setup: grad transforms the function

Mechanics

JAX is functional: there is nothing to attach. You write the function, and grad transforms it into its derivative. The forward pass is an ordinary call:

y = lambda x: 2 * jnp.dot(x, x)
y(x)

Array(28., dtype=float32)

One backward call returns the whole gradient

Mechanics

One call sweeps the graph in reverse, and the result equals the promised 4\mathbf{x}, at every coordinate:

from jax import grad
# The `grad` transform returns a Python function that
# computes the gradient of the original function
x_grad = grad(y)(x)
x_grad

Array([ 0.,  4.,  8., 12.], dtype=float32)

x_grad == 4 * x

Array([ True,  True,  True,  True], dtype=bool)

That reverse sweep is the calculus section’s chain rule, run from output to input.

Working with gradients

accumulation · non-scalar outputs · detaching · inference

Each gradient starts fresh

Gradients

Recording a new computation overwrites the previous gradient; there is no buffer to reset:

y = lambda x: x.sum()
grad(y)(x)

Array([1., 1., 1., 1.], dtype=float32)

Vector outputs: differentiate their sum

Gradients

Gradients are defined for a scalar loss. For a vector y, the engine differentiates the sum of its components (a vector–Jacobian product), exactly what a per-example batch loss needs:

y = lambda x: x * x
# grad is only defined for scalar output functions
grad(lambda x: y(x).sum())(x)

Array([0., 2., 4., 6.], dtype=float32)

detach freezes a value: ∂z/∂x = u, not 3x²

Gradients

Sometimes a value should count as a constant: gradients must not flow through it. detach (or stop_gradient) severs the graph above it, so z = u \cdot x differentiates to u, not to 3x^2:

import jax

y = lambda x: x * x
# jax.lax primitives are Python wrappers around XLA operations
u = jax.lax.stop_gradient(y(x))
z = lambda x: u * x

grad(lambda x: z(x).sum())(x) == u

Array([ True,  True,  True,  True], dtype=bool)

Inference skips the bookkeeping

Gradients

When we only need the value (prediction, evaluation, manual updates), we turn recording off and pay nothing for it. This is the default mode for inference throughout the book:

# No graph is built unless we ask for it via a transform like `grad`
y = 2 * jnp.dot(x, x)
y

Array(28., dtype=float32)

Dynamic graphs

the graph is whatever actually ran

The graph records what actually ran

Dynamic graphs

Autograd never sees your ifs and whiles; it records whichever ops executed. This function’s loop count and branch both depend on its input:

def f(a):
    b = a * 2
    while jnp.linalg.norm(b) < 1000:
        b = b * 2
    if b.sum() > 0:
        c = b
    else:
        c = 100 * b
    return c

Branch or loop, the gradient is exact: f(a)/a

Dynamic graphs

Each call realizes a concrete graph that backward can walk. Whichever branch ran, f scaled its input by some constant, so f(a) = k\,a and the gradient must equal f(a)/a. It does:

from jax import random
a = random.normal(random.key(1), ())
d = f(a)
d_grad = grad(f)(a)

d_grad == d / a

Array(True, dtype=bool)

Reverse mode: the whole gradient for one extra pass

Beyond · payoff

A counting argument settles which way to sweep. With n inputs and m outputs, the full derivative matrix costs m reverse sweeps or n forward sweeps, each sweep priced at roughly one function evaluation.

A training loss has m = 1 and n in the millions: one reverse sweep delivers every parameter’s gradient, for the cost of about one extra forward pass. Forward mode wins the opposite regime (few inputs, many outputs) and Hessian–vector products.

Differentiate the derivative: f″(2) = 12

Beyond

The gradient is itself a function on the graph, so we can differentiate it. For f(x) = x^3 at x = 2: f'(2) = 3x^2 = 12 and f''(2) = 6x = 12: the same number, by coincidence, and autograd nails both:

f = lambda x: x ** 3
dy = grad(f)(2.0)            # 3x^2 = 12
d2y = grad(grad(f))(2.0)    # 6x  = 12
dy, d2y

(Array(12., dtype=float32, weak_type=True),
 Array(12., dtype=float32, weak_type=True))

Recap

Wrap-up

Record forward, sweep backward: the chain rule, automated and verified against 4\mathbf{x}.
One reverse sweep = the whole gradient (m{=}1, n in millions).
detach / no-grad keep values out of the graph.

Mind per-framework gradient handling (PyTorch accumulates).
The graph is built at runtime; control flow needs no special handling.
Higher-order derivatives: differentiate the gradient again.

Backpropagation through real networks gets its full treatment in the backpropagation section; forward vs. reverse mode is derived in the matrix-calculus-and-automatic-differentiation section.