Automatic Differentiation

Dive into Deep Learning · §1.5

From the chain rule to backpropagation
the engine that differentiates a whole network for you.

Record the forward pass; replay it in reverse

Motivation

Hand-deriving gradients for a million-parameter network is hopeless. Instead the framework records each operation as you run the forward pass, then replays it in reverse, applying the chain rule of the calculus section mechanically, to get the gradient w.r.t. every input at once.

Every training step in this book is one forward pass and one backward pass over this graph.

The mechanics

record forward · sweep backward

A function with a known answer: ∇y = 4x

Mechanics

Differentiate y = 2\,\mathbf{x}^\top\mathbf{x} w.r.t. the vector \mathbf{x}. The analytic answer, \nabla_\mathbf{x} y = 4\mathbf{x}, is our sanity check: autograd must reproduce it exactly.

x = np.arange(4.0)
x

array([0., 1., 2., 3.])

Track x, and the graph builds itself

Mechanics

First tell the framework to track x (reserve a slot for its gradient), then run the forward pass; y is now the root of a recorded graph:

# We allocate memory for a tensor's gradient by invoking `attach_grad`
x.attach_grad()
# After we calculate a gradient taken with respect to `x`, we will be able to
# access it via the `grad` attribute, whose values are initialized with 0s
x.grad

array([0., 0., 0., 0.])

# Our code is inside an `autograd.record` scope to build the computational
# graph
with autograd.record():
    y = 2 * np.dot(x, x)
y

array(28.)

One backward call returns the whole gradient

Mechanics

One call sweeps the graph in reverse, and the result equals the promised 4\mathbf{x}, at every coordinate:

y.backward()
x.grad

array([ 0.,  4.,  8., 12.])

x.grad == 4 * x

array([ True,  True,  True,  True])

That reverse sweep is the calculus section’s chain rule, run from output to input.

Working with gradients

accumulation · non-scalar outputs · detaching · inference

Each gradient starts fresh

Gradients

Recording a new computation overwrites the previous gradient; there is no buffer to reset:

with autograd.record():
    y = x.sum()
y.backward()
x.grad  # Overwritten by the newly calculated gradient

array([1., 1., 1., 1.])

Vector outputs: differentiate their sum

Gradients

Gradients are defined for a scalar loss. For a vector y, the engine differentiates the sum of its components (a vector–Jacobian product), exactly what a per-example batch loss needs:

with autograd.record():
    y = x * x  
y.backward()
x.grad  # Equals the gradient of y = sum(x * x)

array([0., 2., 4., 6.])

detach freezes a value: ∂z/∂x = u, not 3x²

Gradients

Sometimes a value should count as a constant: gradients must not flow through it. detach (or stop_gradient) severs the graph above it, so z = u \cdot x differentiates to u, not to 3x^2:

with autograd.record():
    y = x * x
    u = y.detach()
    z = u * x
z.backward()
x.grad == u

array([ True,  True,  True,  True])

Inference skips the bookkeeping

Gradients

When we only need the value (prediction, evaluation, manual updates), we turn recording off and pay nothing for it. This is the default mode for inference throughout the book:

with autograd.record():
    with autograd.pause():
        y = 2 * np.dot(x, x)  # not recorded: no gradient will flow through y
y

array(28.)

Dynamic graphs

the graph is whatever actually ran

The graph records what actually ran

Dynamic graphs

Autograd never sees your ifs and whiles; it records whichever ops executed. This function’s loop count and branch both depend on its input:

def f(a):
    b = a * 2
    while np.linalg.norm(b) < 1000:
        b = b * 2
    if b.sum() > 0:
        c = b
    else:
        c = 100 * b
    return c

Branch or loop, the gradient is exact: f(a)/a

Dynamic graphs

Each call realizes a concrete graph that backward can walk. Whichever branch ran, f scaled its input by some constant, so f(a) = k\,a and the gradient must equal f(a)/a. It does:

a = np.random.normal()
a.attach_grad()
with autograd.record():
    d = f(a)
d.backward()

a.grad == d / a

array(True)

Reverse mode: the whole gradient for one extra pass

Beyond · payoff

A counting argument settles which way to sweep. With n inputs and m outputs, the full derivative matrix costs m reverse sweeps or n forward sweeps, each sweep priced at roughly one function evaluation.

A training loss has m = 1 and n in the millions: one reverse sweep delivers every parameter’s gradient, for the cost of about one extra forward pass. Forward mode wins the opposite regime (few inputs, many outputs) and Hessian–vector products.

Recap

Wrap-up

Record forward, sweep backward: the chain rule, automated and verified against 4\mathbf{x}.
One reverse sweep = the whole gradient (m{=}1, n in millions).
detach / no-grad keep values out of the graph.

Mind per-framework gradient handling (PyTorch accumulates).
The graph is built at runtime; control flow needs no special handling.
Higher-order derivatives: differentiate the gradient again.

Backpropagation through real networks gets its full treatment in the backpropagation section; forward vs. reverse mode is derived in the matrix-calculus-and-automatic-differentiation section.