%matplotlib inline
from d2l import jax as d2l
import jax
from jax import numpy as jnp
import numpy as npPlain gradient descent isn’t what trains deep nets — SGD and its descendants do — but every issue those methods hit shows up here first, in cleaner form: LR sensitivity, divergence, local minima, poor conditioning, second-order corrections.
The rule:
x \leftarrow x - \eta \nabla f(x).
A first-order Taylor expansion shows that for small enough \eta, this decreases f locally. The art is picking \eta.
Setup and define f, f':
Start at x = 10, \eta = 0.2, 10 steps. Converges to 0:
epoch 10, x: 0.060466
\eta = 0.05: takes forever to converge:
epoch 10, x: 3.486784
\eta = 1.1: the \mathcal{O}(\eta^2 f'^2) Taylor remainder dominates and the iterates diverge:
epoch 10, x: 61.917364
f(x) = x \cos(cx) has infinitely many local minima. Even with a moderately large learning rate, GD ends up in whichever basin it falls into:
epoch 10, x: -1.528165
Same rule on vectors:
\mathbf{x} \leftarrow \mathbf{x} - \eta \nabla f(\mathbf{x}).
Demo on f(x_1, x_2) = x_1^2 + 2 x_2^2 — anisotropic, x_2 direction is steeper.
def train_2d(trainer, steps=20, f_grad=None):
"""Optimize a 2D objective function with a customized trainer."""
# `s1` and `s2` are internal state variables that will be used in Momentum, adagrad, RMSProp
x1, x2, s1, s2 = -5, -2, 0, 0
results = [(x1, x2)]
for i in range(steps):
if f_grad:
x1, x2, s1, s2 = trainer(x1, x2, s1, s2, f_grad)
else:
x1, x2, s1, s2 = trainer(x1, x2, s1, s2)
results.append((x1, x2))
print(f'epoch {i + 1}, x1: {float(x1):f}, x2: {float(x2):f}')
return resultsdef show_trace_2d(f, results):
"""Show the trace of 2D variables during optimization."""
d2l.set_figsize()
d2l.plt.plot(*zip(*results), '-o', color='#ff7f0e')
x1, x2 = d2l.meshgrid(d2l.arange(-5.5, 1.0, 0.1),
d2l.arange(-3.0, 1.0, 0.1))
d2l.plt.contour(x1, x2, f(x1, x2), colors='#1f77b4')
d2l.plt.xlabel('x1')
d2l.plt.ylabel('x2')def f_2d(x1, x2): # Objective function
return x1 ** 2 + 2 * x2 ** 2
def f_2d_grad(x1, x2): # Gradient of the objective function
return (2 * x1, 4 * x2)
def gd_2d(x1, x2, s1, s2, f_grad):
g1, g2 = f_grad(x1, x2)
return (x1 - eta * g1, x2 - eta * g2, 0, 0)
eta = 0.1
show_trace_2d(f_2d, train_2d(gd_2d, f_grad=f_2d_grad))epoch 20, x1: -0.057646, x2: -0.000073
Use the Hessian to set the step size automatically. From the second-order Taylor expansion:
\mathbf{x} \leftarrow \mathbf{x} - [\nabla^2 f(\mathbf{x})]^{-1} \nabla f(\mathbf{x}).
For f(x) = \cosh(cx), one Newton step finds the minimum:
c = d2l.tensor(0.5)
def f(x): # Objective function
return d2l.cosh(c * x)
def f_grad(x): # Gradient of the objective function
return c * d2l.sinh(c * x)
def f_hess(x): # Hessian of the objective function
return c**2 * d2l.cosh(c * x)
def newton(eta=1):
x = 10.0
results = [x]
for i in range(10):
x -= eta * f_grad(x) / f_hess(x)
results.append(float(x))
print(f'epoch 10, x: {float(x):f}')
return results
show_trace(newton(), f)epoch 10, x: 0.000000
f(x) = x \cos(cx): Newton happily steps to a maximum if that’s where the second-order model points. Without positive-definite Hessian (i.e. local convexity), Newton breaks:
c = d2l.tensor(0.15 * np.pi)
def f(x): # Objective function
return x * d2l.cos(c * x)
def f_grad(x): # Gradient of the objective function
return d2l.cos(c * x) - c * x * d2l.sin(c * x)
def f_hess(x): # Hessian of the objective function
return - 2 * c * d2l.sin(c * x) - x * c**2 * d2l.cos(c * x)
show_trace(newton(), f)epoch 10, x: 26.834133