Reinforcement learning starts with an agent-environment loop. At time t:
s_t \xrightarrow{\text{agent chooses } a_t} (r_t, s_{t+1}) \xrightarrow{\text{environment}} s_{t+1}.
The hard part is delayed consequence: an action can look bad immediately but set up large future reward, or look good now and lead into a dead end.
A Markov decision process packages the assumptions:
\mathcal{M}=(\mathcal{S},\mathcal{A},P,r,\gamma).
The Markov assumption says the current state contains all history relevant for predicting the next state.
A policy \pi(a\mid s) maps states to action probabilities. Its objective is expected discounted return:
J(\pi)=\mathbb{E}_{\tau\sim\pi} \left[\sum_{t=0}^{\infty}\gamma^t r_t\right].
Discounting does two jobs:
Small \gamma makes the agent myopic; \gamma near 1 makes future rewards matter.
Two regimes lead to the next two decks:
Value iteration is the clean model-known baseline. Q-learning is the sampled, model-free analog.