Automatic Differentiation
Automatic Differentiation
Hand-deriving gradients for a 100-million-parameter network is a non-starter. Every modern framework ships an automatic differentiation engine that:
Records each operation onto a computational graph.
Walks the graph in reverse to apply the chain rule.
Returns the gradient with respect to every input you asked about — typically the model parameters.
This chapter teaches the API; the rest of the book leans on it.
A worked example
We’ll differentiate
y = 2\,\mathbf{x}^\top \mathbf{x}
with respect to the column vector \mathbf{x} . The analytic gradient is \nabla_\mathbf{x} y = 4\mathbf{x} — a useful sanity-check target.
x = tf.range (4 , dtype= tf.float32)
x
<tf.Tensor: shape=(4,), dtype=float32, numpy=array([0., 1., 2., 3.], dtype=float32)>
Tracking gradients
We tell the framework to track operations on x and reserve a slot for its gradient:
Then run the forward pass — y is built from x, so the engine records the dependency:
# Record all computations onto a tape
with tf.GradientTape() as t:
y = 2 * tf.tensordot(x, x, axes= 1 )
y
<tf.Tensor: shape=(), dtype=float32, numpy=28.0>
Backward pass
A single call walks the recorded graph backwards:
x_grad = t.gradient(y, x)
x_grad
<tf.Tensor: shape=(4,), dtype=float32, numpy=array([ 0., 4., 8., 12.], dtype=float32)>
The result lands in x.grad. Compare with the analytic answer, 4\mathbf{x} :
<tf.Tensor: shape=(4,), dtype=bool, numpy=array([ True, True, True, True])>
Resetting & re-using
Gradients accumulate by default — call .zero_() (or its equivalent) before computing a fresh gradient:
with tf.GradientTape() as t:
y = tf.reduce_sum(x)
t.gradient(y, x) # Overwritten by the newly calculated gradient
<tf.Tensor: shape=(4,), dtype=float32, numpy=array([1., 1., 1., 1.], dtype=float32)>
For non-scalar y, the engine sums up gradients computed for each output element (or you supply weights):
with tf.GradientTape() as t:
y = x * x
t.gradient(y, x) # Same as y = tf.reduce_sum(x * x)
<tf.Tensor: shape=(4,), dtype=float32, numpy=array([0., 2., 4., 6.], dtype=float32)>
Detaching from the graph
Sometimes we want a value treated as a constant in the backward pass — e.g., the auxiliary u below should not propagate gradients into x:
# Set persistent=True to preserve the compute graph.
# This lets us run t.gradient more than once
with tf.GradientTape(persistent= True ) as t:
y = x * x
u = tf.stop_gradient(y)
z = u * x
x_grad = t.gradient(z, x)
x_grad == u
<tf.Tensor: shape=(4,), dtype=bool, numpy=array([ True, True, True, True])>
After detach() (or stop_gradient / lax.stop_gradient), the gradient flows around the detached tensor, not through it:
t.gradient(y, x) == 2 * x
<tf.Tensor: shape=(4,), dtype=bool, numpy=array([ True, True, True, True])>
Gradients through control flow
Autograd doesn’t care about Python ifs and whiles — it records whichever ops actually executed. Here’s a function whose behavior depends on its input:
def f(a):
b = a * 2
while tf.norm(b) < 1000 :
b = b * 2
if tf.reduce_sum(b) > 0 :
c = b
else :
c = 100 * b
return c
The number of while iterations and the branch taken both depend on the value of a.
…it just works
Run the function on a random scalar and ask for the gradient:
a = tf.Variable(tf.random.normal(shape= ()))
with tf.GradientTape() as t:
d = f(a)
d_grad = t.gradient(d, a)
d_grad
<tf.Tensor: shape=(), dtype=float32, numpy=409600.0>
The gradient is correct even though the path through the function is data-dependent. Here f(a) ends up linear in a along whichever branch ran, so f'(a) = f(a) / a :
<tf.Tensor: shape=(), dtype=bool, numpy=True>
Recap
Mark inputs as needing gradients.
Run the forward pass — the engine records ops.
backward() (or grad()) walks the graph in reverse via the chain rule.
Gradients accumulate; reset between iterations.
detach / stop_gradient to break the graph.
Works through arbitrary Python control flow.