Markov Decision Process (MDP)

Markov Decision Processes

Reinforcement learning starts with an agent-environment loop. At time t:

s_t \xrightarrow{\text{agent chooses } a_t} (r_t, s_{t+1}) \xrightarrow{\text{environment}} s_{t+1}.

The hard part is delayed consequence: an action can look bad immediately but set up large future reward, or look good now and lead into a dead end.

A Markov decision process packages the assumptions:

\mathcal{M}=(\mathcal{S},\mathcal{A},P,r,\gamma).

The Markov assumption says the current state contains all history relevant for predicting the next state.

A policy \pi(a\mid s) maps states to action probabilities. Its objective is expected discounted return:

J(\pi)=\mathbb{E}_{\tau\sim\pi} \left[\sum_{t=0}^{\infty}\gamma^t r_t\right].

Discounting does two jobs:

Small \gamma makes the agent myopic; \gamma near 1 makes future rewards matter.

Two regimes lead to the next two decks:

Known MDP: P and r are available. Dynamic programming can back up values over all next states exactly.
Unknown MDP: the agent only observes samples (s,a,r,s'). Learning replaces exact expectations with experience.

Value iteration is the clean model-known baseline. Q-learning is the sampled, model-free analog.