Q-learning training loop

Q-Learning

Q-Learning

Value iteration needs the full MDP. Real RL has only samples from the environment. Q-learning (Watkins, 1989) replaces the Bellman backup with a sampled, off-policy update:

Q(s, a) \leftarrow Q(s, a) + \alpha\Big[r + \gamma \max_{a'} Q(s', a') - Q(s, a)\Big].

Three pieces to make this work:

  • Exploration (ε-greedy) — pick a random action with probability ε so all (s, a) get sampled.
  • Self-correction — over-estimates of bad actions get driven down as you keep visiting them.
  • Off-policy — the bootstrap uses \max_{a'} regardless of which action the agent actually took next, so the learned Q approximates the optimal policy, not the behavior policy.

TD target and TD error

The update is easiest to read as a prediction-correction step:

\underbrace{r + \gamma \max_{a'} Q(s',a')}_{\textrm{TD target}} \quad\text{vs.}\quad \underbrace{Q(s,a)}_{\textrm{current estimate}}.

Their difference is the temporal-difference error:

\delta = r + \gamma \max_{a'} Q(s',a') - Q(s,a).

Then Q-learning applies an exponential moving average:

Q(s,a) \leftarrow Q(s,a) + \alpha \delta.

Large \alpha learns quickly but can be noisy; small \alpha is stable but slow. In theory, convergence uses a decaying learning-rate schedule.

Exploration is not optional

If the agent always exploits the current table, early random errors can lock in a bad policy. ε-greedy avoids that failure:

\pi_e(a\mid s)= \begin{cases} \textrm{random action}, & \epsilon \\ \arg\max_{a'} Q(s,a'), & 1-\epsilon. \end{cases}

The practical tradeoff:

  • high ε: broader coverage, slower apparent improvement;
  • low ε: faster exploitation, higher risk of missing good actions;
  • annealed ε: explore early, exploit once estimates are useful.

Frozen Lake setup

Same gridworld as the value-iter deck — easy to see the algorithm converge to the same answer the model-based DP got, but without knowing P in advance:

%matplotlib inline
import numpy as np
import random
from d2l import torch as d2l

seed = 0  # Random number generator seed
gamma = 0.95  # Discount factor
num_iters = 256  # Number of iterations
alpha   = 0.9  # Learning rate
# Anneal epsilon linearly from `epsilon_start` to `epsilon_end` across the
# `num_iters` episodes: explore broadly early, exploit once Q is informative.
epsilon_start = 0.9
epsilon_end = 0.05
random.seed(seed)  # Set the random seed
np.random.seed(seed)

# Now set up the environment
env_info = d2l.make_env('FrozenLake-v1', seed=seed)

Sample episodes; each transition gives one supervised-looking target for one table entry. The update touches only Q(s_t,a_t), but bootstrapping lets information propagate backward through later visits:

def e_greedy(env, Q, s, epsilon):
    if random.random() < epsilon:
        return env.action_space.sample()

    else:
        return np.argmax(Q[s,:])
def q_learning(env_info, gamma, num_iters, alpha, epsilon_start, epsilon_end):
    env_desc = env_info['desc']  # 2D array specifying what each grid item means
    env = env_info['env']  # 2D array specifying what each grid item means
    num_states = env_info['num_states']
    num_actions = env_info['num_actions']

    Q  = np.zeros((num_states, num_actions))
    V  = np.zeros((num_iters + 1, num_states))
    pi = np.zeros((num_iters + 1, num_states))

    for k in range(1, num_iters + 1):
        # Linearly anneal epsilon over episodes
        epsilon = epsilon_start + (epsilon_end - epsilon_start) * (
            (k - 1) / max(1, num_iters - 1))
        # Reset environment
        state, _ = env.reset()
        done = False
        while not done:
            # Select an action for a given state and acts in env based on selected action
            action = e_greedy(env, Q, state, epsilon)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            # Q-update: mask the bootstrap term at terminal states so the
            # target is just the observed reward when the episode ends.
            if done:
                y = reward
            else:
                y = reward + gamma * np.max(Q[next_state,:])
            Q[state, action] = Q[state, action] + alpha * (y - Q[state, action])

            # Move to the next state
            state = next_state
        # Record max value and max action for visualization purpose only
        for s in range(num_states):
            V[k,s]  = np.max(Q[s,:])
            pi[k,s] = np.argmax(Q[s,:])
    d2l.show_Q_function_progress(env_desc, V[:-1], pi[:-1])

q_learning(env_info=env_info, gamma=gamma, num_iters=num_iters, alpha=alpha,
           epsilon_start=epsilon_start, epsilon_end=epsilon_end)

Recap

  • Q-learning = sampled, model-free version of value iteration.
  • Uses ε-greedy exploration + off-policy bootstrap.
  • Converges (with proper schedules) to the optimal Q function, even without knowing the dynamics.
  • DQN (deep Q-learning, Mnih et al. 2015) replaces the table with a neural network — same update, learnable representation, plus a few tricks (replay buffer, target network) for stability.