Value Iteration

For a known MDP — known transition probabilities P and rewards r — we can compute the optimal policy without any learning, via dynamic programming.

Two key concepts:

Value function V^\pi(s) — expected discounted return from state s under policy \pi: V^\pi(s) = \mathbb{E}\big[\sum_{t=0}^\infty \gamma^t r_t \mid s_0 = s\big].
Action-value Q^\pi(s, a) — same, but committing to action a first.

The Bellman optimality equation:

V^*(s) = \max_a \big[r(s,a) + \gamma \sum_{s'} P(s' \mid s, a) V^*(s')\big].

Value iteration turns this fixed-point equation into an algorithm: repeatedly apply the right-hand side as an update until convergence.

What the Bellman backup does

For each state, value iteration asks one local question:

\text{if I take action } a \text{ now, what is my immediate reward plus discounted future value?}

Then it keeps the best action:

V_{k+1}(s) = \max_a \left\{r(s,a) + \gamma \sum_{s'} P(s'\mid s,a)V_k(s')\right\}.

Interpretation:

\sum_{s'} P(s'\mid s,a)V_k(s') averages over stochastic next states.
\gamma controls how far the planning horizon effectively reaches.
The max converts policy evaluation into policy improvement.

Frozen Lake setup

Frozen Lake is a 4×4 gridworld: start S, frozen cells F, holes H, and goal G. The reward is sparse — only the goal pays — so values propagate backward from the goal over repeated Bellman backups.

%matplotlib inline
import numpy as np
import random
from d2l import torch as d2l

seed = 0  # Random number generator seed
gamma = 0.95  # Discount factor
num_iters = 10  # Number of iterations
random.seed(seed)  # Set the random seed to ensure results can be reproduced
np.random.seed(seed)

# Now set up the environment
env_info = d2l.make_env('FrozenLake-v1', seed=seed)

Value iteration loop

Iterate V \leftarrow \max_a [r + \gamma P V] until convergence; then extract the greedy policy \pi(s) = \arg\max_a [r + \gamma P V]:

def value_iteration(env_info, gamma, num_iters):
    env_desc = env_info['desc']  # 2D array shows what each item means
    prob_idx = env_info['trans_prob_idx']
    nextstate_idx = env_info['nextstate_idx']
    reward_idx = env_info['reward_idx']
    num_states = env_info['num_states']
    num_actions = env_info['num_actions']
    mdp = env_info['mdp']

    V  = np.zeros((num_iters + 1, num_states))
    Q  = np.zeros((num_iters + 1, num_states, num_actions))
    pi = np.zeros((num_iters + 1, num_states))

    for k in range(1, num_iters + 1):
        for s in range(num_states):
            for a in range(num_actions):
                # Calculate \sum_{s'} p(s'\mid s,a) [r + \gamma v_k(s')]
                for pxrds in mdp[(s,a)]:
                    # mdp(s,a): [(p1,next1,r1,d1),(p2,next2,r2,d2),..]
                    pr = pxrds[prob_idx]  # p(s'\mid s,a)
                    nextstate = pxrds[nextstate_idx]  # Next state
                    reward = pxrds[reward_idx]  # Reward
                    Q[k,s,a] += pr * (reward + gamma * V[k - 1, nextstate])
            # Record max value and max action
            V[k,s] = np.max(Q[k,s,:])
            pi[k,s] = np.argmax(Q[k,s,:])
    d2l.show_value_function_progress(env_desc, V[:-1], pi[:-1])

value_iteration(env_info=env_info, gamma=gamma, num_iters=num_iters)

Recap

Value iteration is dynamic programming on the Bellman optimality equation.
Requires known transitions and rewards — won’t work on real environments where the dynamics aren’t given.
Converges geometrically with rate \gamma.
The starting point for all model-based RL; learning algorithms (Q-learning, next deck) replace the known P, r with sampled experience.