Automatic Differentiation

Dive into Deep Learning · §1.5

From the chain rule to backpropagation
the engine that differentiates a whole network for you.

Record the forward pass; replay it in reverse

Motivation

Hand-deriving gradients for a million-parameter network is hopeless. Instead the framework records each operation as you run the forward pass, then replays it in reverse, applying the chain rule of the calculus section mechanically, to get the gradient w.r.t. every input at once.

Every training step in this book is one forward pass and one backward pass over this graph.

The mechanics

record forward · sweep backward

A function with a known answer: ∇y = 4x

Mechanics

Differentiate y = 2\,\mathbf{x}^\top\mathbf{x} w.r.t. the vector \mathbf{x}. The analytic answer, \nabla_\mathbf{x} y = 4\mathbf{x}, is our sanity check: autograd must reproduce it exactly.

x = torch.arange(4.0)
x

tensor([0., 1., 2., 3.])

Track x, and the graph builds itself

Mechanics

First tell the framework to track x (reserve a slot for its gradient), then run the forward pass; y is now the root of a recorded graph:

# Can also create x = torch.arange(4.0, requires_grad=True)
x.requires_grad_(True)
x.grad  # The gradient is None by default

y = 2 * torch.dot(x, x)
y

tensor(28., grad_fn=<MulBackward0>)

One backward call returns the whole gradient

Mechanics

One call sweeps the graph in reverse, and the result equals the promised 4\mathbf{x}, at every coordinate:

y.backward()
x.grad

tensor([ 0.,  4.,  8., 12.])

x.grad == 4 * x

tensor([True, True, True, True])

That reverse sweep is the calculus section’s chain rule, run from output to input.

Working with gradients

accumulation · non-scalar outputs · detaching · inference

Gradients accumulate: reset first

Gradients

PyTorch adds each new gradient into x.grad rather than replacing it (handy for summing losses). So zero it before a fresh computation:

x.grad.zero_()  # Reset the gradient
y = x.sum()
y.backward()
x.grad

tensor([1., 1., 1., 1.])

Forgetting .zero_() between iterations is a classic training bug.

Vector outputs: differentiate their sum

Gradients

Gradients are defined for a scalar loss. For a vector y, the engine differentiates the sum of its components (a vector–Jacobian product), exactly what a per-example batch loss needs:

x.grad.zero_()
y = x * x
y.backward(gradient=torch.ones(len(y)))  # Equivalently: y.sum().backward()
x.grad

tensor([0., 2., 4., 6.])

detach freezes a value: ∂z/∂x = u, not 3x²

Gradients

Sometimes a value should count as a constant: gradients must not flow through it. detach (or stop_gradient) severs the graph above it, so z = u \cdot x differentiates to u, not to 3x^2:

x.grad.zero_()
y = x * x
u = y.detach()
z = u * x

z.sum().backward()
x.grad == u

tensor([True, True, True, True])

Inference skips the bookkeeping

Gradients

When we only need the value (prediction, evaluation, manual updates), we turn recording off and pay nothing for it. This is the default mode for inference throughout the book:

with torch.no_grad():
    y = 2 * torch.dot(x, x)
y.requires_grad  # False: y is detached from the graph

False

Dynamic graphs

the graph is whatever actually ran

The graph records what actually ran

Dynamic graphs

Autograd never sees your ifs and whiles; it records whichever ops executed. This function’s loop count and branch both depend on its input:

def f(a):
    b = a * 2
    while b.norm() < 1000:
        b = b * 2
    if b.sum() > 0:
        c = b
    else:
        c = 100 * b
    return c

Branch or loop, the gradient is exact: f(a)/a

Dynamic graphs

Each call realizes a concrete graph that backward can walk. Whichever branch ran, f scaled its input by some constant, so f(a) = k\,a and the gradient must equal f(a)/a. It does:

a = torch.randn(size=(), requires_grad=True)
d = f(a)
d.backward()

a.grad == d / a

tensor(True)

Reverse mode: the whole gradient for one extra pass

Beyond · payoff

A counting argument settles which way to sweep. With n inputs and m outputs, the full derivative matrix costs m reverse sweeps or n forward sweeps, each sweep priced at roughly one function evaluation.

A training loss has m = 1 and n in the millions: one reverse sweep delivers every parameter’s gradient, for the cost of about one extra forward pass. Forward mode wins the opposite regime (few inputs, many outputs) and Hessian–vector products.

Differentiate the derivative: f″(2) = 12

Beyond

The gradient is itself a function on the graph, so we can differentiate it. For f(x) = x^3 at x = 2: f'(2) = 3x^2 = 12 and f''(2) = 6x = 12: the same number, by coincidence, and autograd nails both:

x3 = torch.tensor(2.0, requires_grad=True)
dy = torch.autograd.grad(x3 ** 3, x3, create_graph=True)[0]  # 3x^2 = 12
d2y = torch.autograd.grad(dy, x3)[0]                          # 6x  = 12
dy, d2y

(tensor(12., grad_fn=<MulBackward0>), tensor(12.))

Recap

Wrap-up

Record forward, sweep backward: the chain rule, automated and verified against 4\mathbf{x}.
One reverse sweep = the whole gradient (m{=}1, n in millions).
detach / no-grad keep values out of the graph.

Mind per-framework gradient handling (PyTorch accumulates).
The graph is built at runtime; control flow needs no special handling.
Higher-order derivatives: differentiate the gradient again.

Backpropagation through real networks gets its full treatment in the backpropagation section; forward vs. reverse mode is derived in the matrix-calculus-and-automatic-differentiation section.