Lecture 2

Tabular MDP Planning

Formalizing sequential decision-making with Markov Decision Processes and solving them with dynamic programming.

MDP Bellman Equations Dynamic Programming Policy Iteration Value Iteration
Original PDF slides

Review: Markov Reward Processes

Before introducing decisions and actions, we consolidate the machinery for Markov Reward Processes (MRPs) from Lecture 1. An MRP models an environment where an agent passively receives rewards as it transitions between states according to fixed dynamics—there are no choices to make. MRPs are the foundation on which the full MDP framework is built.

Return and Value Function

The horizon $H$ is the number of time steps in each episode. It may be finite or infinite. The return $G_t$ is the discounted sum of rewards received from time step $t$ onward:

$$G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots + \gamma^{H-1} r_{t+H-1}$$

The state value function $V(s)$ for an MRP is the expected return starting from state $s$:

$$V(s) = \mathbb{E}[G_t \mid s_t = s] = \mathbb{E}\!\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots + \gamma^{H-1} r_{t+H-1} \mid s_t = s\right]$$

Bellman Equation for MRPs

The Markov property gives a recursive decomposition: the value of a state splits into the immediate reward plus the discounted value of successor states.

Bellman Equation (MRP)

For every state $s \in S$:

$$V(s) = R(s) + \gamma \sum_{s' \in S} P(s' \mid s)\, V(s')$$

The first term $R(s)$ captures the immediate reward; the second term captures the expected discounted future reward, weighted by the transition probabilities.

Matrix Form and Analytic Solution

For a finite-state MRP with $N = |S|$ states, the Bellman equation can be written in matrix form. Let $\mathbf{V}$ be the column vector of state values, $\mathbf{R}$ the column vector of rewards, and $\mathbf{P}$ the $N \times N$ transition matrix. Then:

$$\mathbf{V} = \mathbf{R} + \gamma \mathbf{P} \mathbf{V}$$

Rearranging yields a closed-form solution:

$$(\mathbf{I} - \gamma \mathbf{P})\,\mathbf{V} = \mathbf{R} \quad\Longrightarrow\quad \mathbf{V} = (\mathbf{I} - \gamma \mathbf{P})^{-1}\,\mathbf{R}$$

This direct solve requires inverting an $N \times N$ matrix, costing $O(N^3)$. For large state spaces this is prohibitive, motivating iterative methods.

Iterative Algorithm for MRP Value Computation

Dynamic programming provides a scalable alternative. We initialize all values to zero and repeatedly apply the Bellman equation as an update rule until convergence:

Algorithm: Iterative Policy Evaluation for MRP
  1. Initialize $V_0(s) = 0$ for all $s \in S$
  2. For $k = 1, 2, \ldots$ until convergence:
  3. For all $s \in S$: $$V_k(s) = R(s) + \gamma \sum_{s' \in S} P(s' \mid s)\, V_{k-1}(s')$$

Computational cost: $O(|S|^2)$ per iteration.

Each iteration applies one "Bellman backup" across all states, pulling value information one step further back in time. This procedure converges to the true value function $V$ when $\gamma < 1$ or when the process terminates with probability 1.

Markov Decision Processes

A Markov Decision Process extends the MRP by introducing actions—the agent now chooses what to do in each state, and both the transitions and rewards may depend on that choice. MDPs are the standard mathematical framework for sequential decision-making under uncertainty.

Definition — Markov Decision Process

An MDP is a tuple $(S, A, P, R, \gamma)$ where:

  • $S$ is a finite set of states
  • $A$ is a finite set of actions
  • $P$ is the transition model: $P(s_{t+1} = s' \mid s_t = s,\, a_t = a)$ gives the probability of reaching state $s'$ after taking action $a$ in state $s$
  • $R$ is the reward function: $R(s, a) = \mathbb{E}[r_t \mid s_t = s,\, a_t = a]$
  • $\gamma \in [0, 1]$ is the discount factor

Note: the reward function is sometimes defined as a function of state alone, or of the $(s, a, s')$ tuple. In this course we primarily use $R(s, a)$.

Key Insight

A large discount factor $\gamma$ (close to 1) means the agent cares about long-term rewards almost as much as immediate ones. A small $\gamma$ (close to 0) makes the agent myopic, heavily prioritizing short-term payoffs.

Example: Mars Rover MDP

Example — Mars Rover

Consider a Mars rover that can occupy one of seven discrete locations $s_1, s_2, \ldots, s_7$ arranged in a line. The rover has two deterministic actions: move left ($a_1$) and move right ($a_2$). At the boundaries, the rover stays in place.

The reward structure is: $+1$ in state $s_1$, $+10$ in state $s_7$, and $0$ in all other states (regardless of the action taken). The challenge for the rover is to figure out how to navigate to high-reward locations while balancing the discounted value of reaching the far-right state $s_7$ against the smaller but closer reward at $s_1$.

With 7 states and 2 actions, the number of possible deterministic policies is $|A|^{|S|} = 2^7 = 128$. Even for this tiny problem, exhaustive search over policies becomes tedious—motivating the algorithmic approaches we develop below.

Policies

A policy is the agent's strategy: it specifies which action to take (or a distribution over actions) in each state. Policies can be deterministic (mapping each state to a single action) or stochastic (mapping each state to a probability distribution over actions).

Definition — Policy

A policy $\pi$ is a conditional distribution over actions given states:

$$\pi(a \mid s) = P(a_t = a \mid s_t = s)$$

For a deterministic policy, $\pi(s)$ returns a single action directly. For a stochastic policy, $\pi(a \mid s)$ gives the probability of choosing action $a$ in state $s$.

MDP + Policy = MRP

A crucial observation is that once we fix a policy $\pi$ in an MDP, the resulting system is just an MRP. The action randomness introduced by $\pi$ can be "folded in" to the dynamics and reward:

$$R^\pi(s) = \sum_{a \in A} \pi(a \mid s)\, R(s, a)$$ $$P^\pi(s' \mid s) = \sum_{a \in A} \pi(a \mid s)\, P(s' \mid s, a)$$
Key Insight

Since an MDP with a fixed policy reduces to an MRP $(S, R^\pi, P^\pi, \gamma)$, we can reuse all of our MRP evaluation techniques (matrix inversion, iterative dynamic programming) to compute the value of any given policy. This reduction is the conceptual linchpin of policy evaluation.

Value Functions for MDPs

State Value Function

The state value function $V^\pi(s)$ measures the expected discounted return when starting in state $s$ and following policy $\pi$ thereafter. Since an MDP under a fixed policy is an MRP, the Bellman equation carries over directly:

Bellman Expectation Equation — State Value
$$V^\pi(s) = \sum_{a \in A} \pi(a \mid s) \left[ R(s, a) + \gamma \sum_{s' \in S} P(s' \mid s, a)\, V^\pi(s') \right]$$

This says: the value of state $s$ under policy $\pi$ is the expected immediate reward plus the discounted expected value of the next state, where the expectation is over both the action chosen by $\pi$ and the stochastic transition.

For a deterministic policy $\pi(s) = a$, the outer sum collapses and the equation simplifies to:

$$V^\pi(s) = R(s, \pi(s)) + \gamma \sum_{s' \in S} P(s' \mid s, \pi(s))\, V^\pi(s')$$

Action Value Function (Q-function)

While $V^\pi(s)$ tells us how good it is to be in a state, we often need to know how good it is to take a specific action in that state and then follow the policy. This is the state-action value function or Q-function.

Definition — State-Action Value Function
$$Q^\pi(s, a) = R(s, a) + \gamma \sum_{s' \in S} P(s' \mid s, a)\, V^\pi(s')$$

$Q^\pi(s, a)$ is the expected return from taking action $a$ in state $s$ and then following policy $\pi$ from the next state onward.

The relationship between $V^\pi$ and $Q^\pi$ is immediate: $V^\pi(s) = \sum_{a} \pi(a \mid s)\, Q^\pi(s, a)$. The Q-function will prove especially useful for policy improvement, since comparing $Q^\pi(s, a)$ across actions tells us which action is best in each state.

Policy Evaluation

Policy evaluation is the problem of computing $V^\pi$ for a given policy $\pi$. Since the MDP under $\pi$ is an MRP, we can apply the same iterative dynamic programming approach: start with $V_0(s) = 0$ everywhere and repeatedly apply the Bellman expectation equation as an update rule.

Algorithm: Iterative Policy Evaluation (MDP)
  1. Initialize $V_0(s) = 0$ for all $s \in S$
  2. For $k = 1, 2, \ldots$ until convergence:
  3. For all $s \in S$: $$V_k^\pi(s) = \sum_{a \in A} \pi(a \mid s) \left[ R(s, a) + \gamma \sum_{s' \in S} P(s' \mid s, a)\, V_{k-1}^\pi(s') \right]$$

Each update is called a "Bellman backup" for policy $\pi$. For a deterministic policy, the update simplifies to $V_k^\pi(s) = R(s, \pi(s)) + \gamma \sum_{s' \in S} P(s' \mid s, \pi(s))\, V_{k-1}^\pi(s')$.

Example — One Iteration on the Mars Rover

Consider the Mars rover with policy $\pi(s) = a_1$ (always move left) for all states. The dynamics at state $s_6$ are: $P(s_6 \mid s_6, a_1) = 0.5$ and $P(s_7 \mid s_6, a_1) = 0.5$. Given $\gamma = 0.5$ and $V_k = [1,\, 0,\, 0,\, 0,\, 0,\, 0,\, 10]$:

$$V_{k+1}(s_6) = R(s_6, a_1) + \gamma \sum_{s'} P(s' \mid s_6, a_1)\, V_k(s')$$ $$= 0 + 0.5 \times (0.5 \times 0 + 0.5 \times 10) = 0 + 0.5 \times 5 = 2.5$$

The value of $s_6$ increases from 0 to 2.5 as information about the high reward at $s_7$ propagates one step backward through the Bellman backup.

Optimal Policies and Value Functions

The central goal of MDP planning is to find an optimal policy $\pi^*$—one that achieves the highest possible value in every state simultaneously:

$$\pi^*(s) = \arg\max_\pi V^\pi(s)$$

A fundamental result in MDP theory guarantees the following properties for finite state-action spaces with $\gamma < 1$ (or episodic tasks):

Bellman Optimality Equation

The optimal value function satisfies its own recursive characterization. Instead of averaging over actions according to a policy, we maximize:

Bellman Optimality Equation
$$V^*(s) = \max_{a \in A} \left[ R(s, a) + \gamma \sum_{s' \in S} P(s' \mid s, a)\, V^*(s') \right]$$

Equivalently, for the optimal Q-function:

$$Q^*(s, a) = R(s, a) + \gamma \sum_{s' \in S} P(s' \mid s, a)\, V^*(s')$$

with $V^*(s) = \max_{a} Q^*(s, a)$. Given $Q^*$, the optimal policy is simply $\pi^*(s) = \arg\max_{a} Q^*(s, a)$.

Unlike the Bellman expectation equation (which is linear in $V^\pi$), the Bellman optimality equation is nonlinear due to the $\max$ operator. This means we cannot solve it with a single matrix inversion. Instead, we turn to iterative algorithms: policy iteration and value iteration.

Policy Iteration

Policy iteration (PI) alternates between two steps: (1) fully evaluate the current policy to obtain its value function, and (2) improve the policy by acting greedily with respect to that value function. This evaluate-then-improve loop is guaranteed to converge to the optimal policy in a finite number of iterations.

Algorithm: Policy Iteration
  1. Initialize $\pi_0(s)$ arbitrarily for all $s \in S$. Set $i = 0$.
  2. Repeat until $\pi_{i+1} = \pi_i$ (policy has converged):
  3. Policy Evaluation: Compute $V^{\pi_i}$ by solving the Bellman expectation equation for policy $\pi_i$ (via matrix inversion or iterative updates).
  4. Policy Improvement: For all $s \in S$ and $a \in A$, compute $$Q^{\pi_i}(s, a) = R(s, a) + \gamma \sum_{s' \in S} P(s' \mid s, a)\, V^{\pi_i}(s')$$ then set $$\pi_{i+1}(s) = \arg\max_{a} Q^{\pi_i}(s, a)$$
  5. Set $i = i + 1$.

Monotonic Improvement Guarantee

The key theoretical property of policy iteration is that each improvement step produces a policy that is at least as good as the previous one—and strictly better if the previous policy was suboptimal.

Theorem — Monotonic Policy Improvement

Let $\pi_{i+1}$ be the greedy policy obtained from $V^{\pi_i}$. Then for all $s \in S$:

$$V^{\pi_{i+1}}(s) \geq V^{\pi_i}(s)$$

with strict inequality for at least one state if $\pi_i$ is not already optimal.

Proof of Monotonic Improvement

The intuition proceeds in two stages. First, the greedy step guarantees a one-step improvement. Then we extend this to the full infinite-horizon value.

One-step improvement. By construction of the greedy policy:

$$V^{\pi_i}(s) \leq \max_{a} Q^{\pi_i}(s, a) = \max_{a} \left[ R(s, a) + \gamma \sum_{s' \in S} P(s' \mid s, a)\, V^{\pi_i}(s') \right]$$

The right-hand side is precisely $Q^{\pi_i}(s, \pi_{i+1}(s))$, the value of taking the new policy's action for one step and then reverting to $\pi_i$. Since this is at least $V^{\pi_i}(s)$, taking $\pi_{i+1}$'s action for one step and then following $\pi_i$ is at least as good as following $\pi_i$ from the start.

Extension to all steps. But $\pi_{i+1}$ does not just take one improved action—it takes the improved action at every subsequent step as well. Applying the one-step argument recursively (or by induction on the horizon), the improvement compounds at each time step, yielding $V^{\pi_{i+1}}(s) \geq V^{\pi_i}(s)$ for all $s$.

Convergence of Policy Iteration

Since there are at most $|A|^{|S|}$ deterministic policies and each iteration strictly improves the policy (unless already optimal), policy iteration must terminate in a finite number of steps. When the policy stops changing—that is, $\pi_{i+1}(s) = \pi_i(s)$ for all $s$—the policy satisfies the Bellman optimality equation and is therefore optimal.

Key Insight

If the policy does not change after an improvement step, it can never change again. This is because $Q^{\pi_{i+1}} = Q^{\pi_i}$ when $\pi_{i+1} = \pi_i$, so the next greedy step produces the same policy: $\pi_{i+2}(s) = \arg\max_{a} Q^{\pi_{i+1}}(s, a) = \arg\max_{a} Q^{\pi_i}(s, a) = \pi_{i+1}(s)$. Convergence is permanent.

Bellman Backup Operators

We can formalize the update rules as operators that map one value function to another. This abstraction clarifies the relationship between policy evaluation, policy iteration, and value iteration.

Policy Bellman Operator

The Bellman backup operator for policy $\pi$ is defined as:

$$B^\pi V(s) = R^\pi(s) + \gamma \sum_{s' \in S} P^\pi(s' \mid s)\, V(s')$$

Policy evaluation amounts to finding the fixed point of $B^\pi$: the value function $V^\pi$ such that $B^\pi V^\pi = V^\pi$. Iteratively applying $B^\pi$ to any initial value function converges to this fixed point:

$$V^\pi = B^\pi B^\pi \cdots B^\pi V_0$$

Optimal Bellman Operator

The Bellman optimality operator (or simply the Bellman operator) replaces the policy average with a maximization:

$$BV(s) = \max_{a \in A} \left[ R(s, a) + \gamma \sum_{s' \in S} P(s' \mid s, a)\, V(s') \right]$$

This operator takes a value function as input and returns an improved value function. Its fixed point is the optimal value function $V^*$: $BV^* = V^*$.

Value Iteration

Value iteration (VI) takes a different approach from policy iteration. Instead of fully evaluating a policy and then improving it, value iteration directly computes the optimal value function by repeatedly applying the Bellman optimality operator. The idea is to maintain the optimal value for acting with $k$ steps remaining, and then extend to $k+1$ steps.

Algorithm: Value Iteration
  1. Initialize $V_0(s) = 0$ for all $s \in S$. Set $k = 0$.
  2. Repeat until $\|V_{k+1} - V_k\|_\infty \leq \epsilon$:
  3. For each $s \in S$: $$V_{k+1}(s) = \max_{a \in A} \left[ R(s, a) + \gamma \sum_{s' \in S} P(s' \mid s, a)\, V_k(s') \right]$$
  4. Set $k = k + 1$.
  5. Extract policy: $$\pi(s) = \arg\max_{a \in A} \left[ R(s, a) + \gamma \sum_{s' \in S} P(s' \mid s, a)\, V_{k}(s') \right]$$

In Bellman operator notation: $V_{k+1} = B V_k$. The algorithm converges when the Bellman residual $\|V_{k+1} - V_k\|_\infty$ falls below threshold $\epsilon$.

An important distinction: value iteration does not maintain an explicit policy during the iterations. It only extracts the optimal policy at the end (or at any intermediate step, if desired) using the greedy operation.

Finite-Horizon Value Iteration

For problems with a finite horizon $H$, value iteration has an especially clean interpretation. We run exactly $H$ iterations, and $V_k$ represents the optimal value with $k$ steps remaining. The optimal policy for a finite-horizon problem is generally non-stationary: the best action in a state may depend on how many steps remain. This contrasts with the infinite-horizon case, where the optimal policy is always stationary.

Example — Mars Rover with Finite Horizon

Consider running the Mars rover MDP with horizon $H = 4$ and $\gamma = 1/2$, starting from state $s_4$. We can estimate the value by simulating episodes and averaging returns:

  • Episode $s_4 \to s_5 \to s_6 \to s_7$: return $= 0 + \frac{1}{2}\cdot 0 + \frac{1}{4}\cdot 0 + \frac{1}{8}\cdot 10 = 1.25$
  • Episode $s_4 \to s_4 \to s_5 \to s_4$: return $= 0 + \frac{1}{2}\cdot 0 + \frac{1}{4}\cdot 0 + \frac{1}{8}\cdot 0 = 0$
  • Episode $s_4 \to s_3 \to s_2 \to s_1$: return $= 0 + \frac{1}{2}\cdot 0 + \frac{1}{4}\cdot 0 + \frac{1}{8}\cdot 1 = 0.125$

The average return over many such simulated episodes converges to the true value $V^\pi(s_4)$ by concentration inequalities—no Markov assumption is required for this Monte Carlo estimate.

Contraction and Convergence

Why do these iterative algorithms converge? The answer lies in the contraction property of the Bellman operators.

Definition — Contraction Operator

An operator $O$ is a contraction (with respect to a norm $\|\cdot\|$) if there exists a constant $0 \leq c < 1$ such that for all value functions $V, V'$:

$$\|OV - OV'\| \leq c \cdot \|V - V'\|$$

A contraction operator brings any two inputs closer together after each application. By the Banach fixed-point theorem, a contraction on a complete metric space has a unique fixed point, and iterating the operator from any starting point converges to it.

Theorem — Bellman Operator is a $\gamma$-Contraction

For $\gamma < 1$, the Bellman optimality operator $B$ is a contraction in the infinity norm with modulus $\gamma$:

$$\|BV - BV'\|_\infty \leq \gamma \|V - V'\|_\infty$$

The same holds for the policy Bellman operator $B^\pi$. This guarantees that both value iteration and (the iterative component of) policy iteration converge to their respective fixed points.

Proof: Bellman Backup is a Contraction

Let $\|V - V'\|_\infty = \max_s |V(s) - V'(s)|$. We want to show that $\|BV - BV'\|_\infty \leq \gamma \|V - V'\|_\infty$.

For any state $s$, let $a^* = \arg\max_a \left[R(s,a) + \gamma \sum_{s'} P(s'|s,a) V(s')\right]$ and $a' = \arg\max_a \left[R(s,a) + \gamma \sum_{s'} P(s'|s,a) V'(s')\right]$. Then:

$$BV(s) - BV'(s) = \left[R(s, a^*) + \gamma \sum_{s'} P(s'|s,a^*) V(s')\right] - \left[R(s, a') + \gamma \sum_{s'} P(s'|s,a') V'(s')\right]$$

Since $a^*$ maximizes the first expression, we have $BV(s) \leq R(s, a') + \gamma \sum_{s'} P(s'|s,a') V(s')$ is not guaranteed, but we can bound the difference by noting that for any fixed action $a$:

$$BV(s) \leq R(s,a) + \gamma \sum_{s'} P(s'|s,a) V(s') \quad \text{is maximized at } a^*$$

Using the fact that $\max_a f(a) - \max_a g(a) \leq \max_a [f(a) - g(a)]$, we get:

$$BV(s) - BV'(s) \leq \max_a \gamma \sum_{s' \in S} P(s'|s,a)\left[V(s') - V'(s')\right]$$ $$\leq \gamma \max_a \sum_{s'} P(s'|s,a) \|V - V'\|_\infty = \gamma \|V - V'\|_\infty$$

The last step uses the fact that probabilities sum to 1. By symmetry (swapping $V$ and $V'$), we also get $BV'(s) - BV(s) \leq \gamma \|V - V'\|_\infty$. Combining both directions:

$$|BV(s) - BV'(s)| \leq \gamma \|V - V'\|_\infty \quad \forall s$$

Taking the max over $s$ yields $\|BV - BV'\|_\infty \leq \gamma \|V - V'\|_\infty$. $\square$

Regardless of how we initialize $V_0$, value iteration converges to $V^*$. The distance to the fixed point shrinks by a factor of $\gamma$ per iteration, so after $k$ iterations the error is at most $\gamma^k \|V_0 - V^*\|_\infty$. For $\gamma$ close to 1, convergence is slow; for small $\gamma$, it is fast.

Comparing Policy Iteration and Value Iteration

Both policy iteration and value iteration converge to the optimal policy, but they take different routes. Their trade-offs matter for choosing the right algorithm in practice.

Policy iteration computes the exact infinite-horizon value of the current policy before improving it. Each outer iteration is expensive (the evaluation step may require many inner iterations or an $O(|S|^3)$ matrix solve), but the number of outer iterations is typically small—bounded by $|A|^{|S|}$ in the worst case, but often far fewer.

Value iteration performs a single Bellman optimality backup per iteration (combining evaluation and improvement into one step). Each iteration is cheaper than a full policy evaluation, but more iterations may be needed for the values to converge.

Key Insight

Policy iteration guarantees monotonic improvement in the policy's true infinite-horizon value at each outer iteration. Value iteration, by contrast, monotonically improves the value function estimates, but the policy extracted at each intermediate step is not guaranteed to improve monotonically. Despite this, both algorithms provably converge to the same optimal solution. Policy iteration is closely related to the policy gradient methods that are widely used in modern RL.

Summary

This lecture established the mathematical framework for sequential decision-making when a model is known. The key concepts and results:

All of the algorithms above require a complete model of the environment ($P$ and $R$). Next, we address the far more common scenario: how to evaluate a policy when the model is unknown and the agent must learn from experience.