Recap: Model-Free Policy Evaluation
Last lecture we developed methods for estimating $V^\pi(s)$ without access to the MDP's dynamics or reward model. We covered three approaches.
Monte Carlo (MC) policy evaluation waits until the end of an episode, computes the empirical return $G_t$, and updates:
$$V^\pi(s) \leftarrow (1 - \alpha) V^\pi(s) + \alpha\, G_t$$Temporal-difference TD(0) updates after every transition $(s, a, r, s')$, bootstrapping off the current estimate:
$$V^\pi(s) \leftarrow V^\pi(s) + \alpha \bigl[ r + \gamma V^\pi(s') - V^\pi(s) \bigr]$$Certainty equivalence maintains maximum-likelihood estimates of the dynamics and reward from observed tuples, then solves for $V^\pi$ using dynamic programming. This approach exploits the Markov structure directly.
Batch MC and TD Convergence
When we reuse data (batch/offline setting), MC and TD converge to different solutions. Given $K$ episodes, MC converges to the value minimizing mean-squared error against observed returns, while TD(0) converges to the certainty-equivalence solution—the value function that would be correct if the maximum-likelihood MDP model were the true model.
Consider two states $A$ and $B$ with $\gamma = 1$ and 8 episodes of experience: one episode $A, 0, B, 0$; six episodes of $B, 1$; and one episode $B, 0$. Both MC and TD agree that $V(B) = 0.75$. However, they disagree on $V(A)$. MC sets $V(A) = 0$ (the only observed return from $A$ was 0), whereas TD sets $V(A) = 0.75$ because the MLE model says $A$ always transitions to $B$, and $V(B) = 0.75$. TD exploits the Markov structure; MC does not.
Key metrics for comparing these algorithms:
- Data efficiency — how many samples are needed.
- Computational efficiency — cost per update (both MC and TD(0) are $O(L)$ per episode of length $L$).
- Accuracy — bias, variance, and mean-squared error.
TD exploits the Markov property and can be more data-efficient when the domain is truly Markov, but it introduces bias through bootstrapping.
From Evaluation to Control: Generalized Policy Iteration
Policy evaluation tells us how good a policy is; control tells us how to find a better one. Generalized Policy Iteration (GPI) alternates between two steps:
- Policy evaluation: estimate $Q^\pi(s, a)$ (we use $Q$ rather than $V$ because, without a model, we cannot do the one-step lookahead required to extract a policy from $V$ alone).
- Policy improvement: set the new policy $\pi'$ to be greedy (or approximately greedy) with respect to the estimated $Q^\pi$.
A subtlety arises immediately: if $\pi$ is deterministic, we only ever take action $\pi(s)$ in state $s$, so we cannot estimate $Q(s, a)$ for $a \neq \pi(s)$. This is the exploration problem—we must try different actions to learn about them, even if that sacrifices short-term reward.
Epsilon-Greedy Policies
A simple and effective exploration strategy is the $\epsilon$-greedy policy:
An $\epsilon$-greedy policy selects the greedy action with high probability and explores uniformly otherwise:
$$\pi(a \mid s) = \begin{cases} 1 - \epsilon + \frac{\epsilon}{|A|} & \text{if } a = \arg\max_{a'} Q(s, a') \\ \frac{\epsilon}{|A|} & \text{otherwise} \end{cases}$$With probability $1 - \epsilon$ take the best-known action; with probability $\epsilon$ choose uniformly at random (including the greedy one).
The crucial property: policy improvement still works. If we form $\pi_{i+1}$ as $\epsilon$-greedy with respect to $Q^{\pi_i}$, the resulting policy is at least as good as $\pi_i$ everywhere.
For any $\epsilon$-greedy policy $\pi_i$, the $\epsilon$-greedy policy $\pi_{i+1}$ with respect to $Q^{\pi_i}$ satisfies $V^{\pi_{i+1}}(s) \geq V^{\pi_i}(s)$ for all $s$. The proof follows from expanding:
$$Q^{\pi_i}(s, \pi_{i+1}(s)) = \frac{\epsilon}{|A|} \sum_{a \in A} Q^{\pi_i}(s,a) + (1-\epsilon)\max_a Q^{\pi_i}(s,a) \geq V^{\pi_i}(s)$$The inequality holds because the max is at least as large as the average.
Monte Carlo Control
Combining MC evaluation with $\epsilon$-greedy improvement gives a complete model-free control algorithm. After each episode, update $Q$ using observed returns, then recompute the policy.
- Initialize $Q(s,a) = 0$, $N(s,a) = 0$ for all $(s,a)$. Set $\epsilon = 1$, $k = 1$.
- Set $\pi_k = \epsilon\text{-greedy}(Q)$.
- Loop:
- Sample episode $k$: $(s_{k,1}, a_{k,1}, r_{k,1}, s_{k,2}, \ldots, s_{k,T})$ following $\pi_k$.
- Compute returns: $G_{k,t} = r_{k,t} + \gamma\, r_{k,t+1} + \gamma^2\, r_{k,t+2} + \cdots$
- For each first-visit $(s_t, a_t)$ in episode $k$:
- $N(s_t, a_t) \leftarrow N(s_t, a_t) + 1$
- $Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \frac{1}{N(s_t, a_t)} \bigl[ G_{k,t} - Q(s_t, a_t) \bigr]$
- $k \leftarrow k + 1$, $\epsilon \leftarrow 1/k$.
- $\pi_k \leftarrow \epsilon\text{-greedy}(Q)$. // Policy improvement
GLIE: Greedy in the Limit of Infinite Exploration
For MC control to converge to the optimal action-value function, we need the exploration schedule to satisfy the GLIE condition:
A learning policy satisfies Greedy in the Limit of Infinite Exploration (GLIE) if:
- All state-action pairs are visited infinitely often: $\lim_{i \to \infty} N_i(s,a) = \infty$ for all $(s, a)$.
- The behavior policy converges to the greedy policy as $i \to \infty$.
A simple GLIE strategy: $\epsilon$-greedy with $\epsilon_i = 1/i$, which decays to zero while ensuring sufficient early exploration.
GLIE Monte Carlo control converges to the optimal state-action value function: $Q(s,a) \to Q^*(s,a)$ for all $(s, a)$.
On-Policy vs. Off-Policy Learning
Before introducing TD-based control, let's clarify a fundamental distinction in how agents use experience.
On-policy methods learn the value of the policy currently being followed. The same policy generates experience and gets improved. SARSA is the canonical example.
Off-policy methods learn the value of a target policy $\pi$ while following a different behavior policy $\pi_b$, decoupling exploration from the policy being optimized. Q-learning is the canonical example—it estimates $Q^*$ regardless of the behavior policy, as long as $\pi_b$ provides sufficient exploration.
Off-policy learning is more flexible (reuse data, learn from demonstrations, learn about multiple policies simultaneously), but introduces challenges around sample corrections and stability.
SARSA: On-Policy TD Control
SARSA applies TD ideas to control. Instead of waiting for full episode returns, it updates $Q$ after every transition. The name comes from the quintuple used in each update: $(S_t, A_t, R_t, S_{t+1}, A_{t+1})$.
After observing transition $(s_t, a_t, r_t, s_{t+1})$ and choosing the next action $a_{t+1} \sim \pi(s_{t+1})$, update:
$$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \bigl[ r_t + \gamma\, Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \bigr]$$The bracketed term is the TD error. Because SARSA uses the action $a_{t+1}$ actually taken under the current policy, it is on-policy.
The full algorithm interleaves TD updates with $\epsilon$-greedy improvement at every step:
- Initialize $Q(s,a)$ for all $s \in \mathcal{S}, a \in \mathcal{A}$. Set $t = 0$, initial state $s_t = s_0$.
- Choose $a_t \sim \pi(s_t)$ where $\pi$ is $\epsilon$-greedy w.r.t. $Q$.
- Loop:
- Take action $a_t$, observe $(r_t, s_{t+1})$.
- Choose $a_{t+1} \sim \pi(s_{t+1})$. // next action from current policy
- $Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \bigl[ r_t + \gamma\, Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \bigr]$
- $\pi(s_t) \leftarrow \epsilon\text{-greedy}(Q)$. // improve policy
- $t \leftarrow t + 1$.
Q-Learning: Off-Policy TD Control
Q-learning modifies the SARSA update in one consequential way: instead of using the action $a_{t+1}$ actually taken, it bootstraps off the best possible action at the next state. This makes Q-learning off-policy—it directly estimates $Q^*$ regardless of the behavior policy.
After observing $(s_t, a_t, r_t, s_{t+1})$, update:
$$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \bigl[ r_t + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t) \bigr]$$The key difference from SARSA is the $\max_{a'}$: "What is the best I could do from $s_{t+1}$?" regardless of what the behavior policy will actually take. This directly approximates the Bellman optimality equation.
- Initialize $Q(s,a)$ for all $s \in \mathcal{S}, a \in \mathcal{A}$. Set $t = 0$, initial state $s_t = s_0$.
- Set behavior policy $\pi_b = \epsilon\text{-greedy}(Q)$.
- Loop:
- Take $a_t \sim \pi_b(s_t)$. // sample from behavior policy
- Observe $(r_t, s_{t+1})$.
- $Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \bigl[ r_t + \gamma \max_a Q(s_{t+1}, a) - Q(s_t, a_t) \bigr]$
- $\pi_b(s_t) \leftarrow \epsilon\text{-greedy}(Q)$.
- $t \leftarrow t + 1$.
Consider a Mars rover with 7 states, two actions $a_1, a_2$, rewards $r(\cdot, a_1) = [1, 0, 0, 0, 0, 0, 10]$ and $r(\cdot, a_2) = [0, 0, 0, 0, 0, 0, 5]$, $\gamma = 1$, and $\alpha = 0.5$. Suppose the rover is in state $s_6$, takes $a_1$, receives reward $0$, transitions to $s_7$, then takes $a_2$ receiving reward $5$.
With SARSA, the update uses the actual next action $a_2$: $Q(s_6, a_1) = 0.5 \cdot 0 + 0.5 \cdot (0 + Q(s_7, a_2)) = 0.5 \cdot 5 = 2.5$.
With Q-learning, the update uses the max over actions: $Q(s_6, a_1) = 0 + 0.5 \cdot (0 + \max_{a'} Q(s_7, a') - 0) = 0.5 \cdot 10 = 5$.
Q-learning gives a higher value because it assumes the best action will be taken next, while SARSA accounts for the $\epsilon$-greedy policy sometimes exploring suboptimal actions.
Convergence of Q-Learning
Q-learning for finite-state, finite-action MDPs converges to the optimal action-value function, $Q(s,a) \to Q^*(s,a)$, under two conditions:
- The behavior policy $\pi_t(a \mid s)$ satisfies GLIE (all state-action pairs visited infinitely often).
- The step sizes satisfy the Robbins-Monro conditions: $$\sum_{t=1}^{\infty} \alpha_t = \infty \qquad \text{and} \qquad \sum_{t=1}^{\infty} \alpha_t^2 < \infty$$
For example, $\alpha_t = 1/t$ works. The proof relies on stochastic approximation theory, the contraction property of the Bellman optimality operator, and bounded rewards.
Scaling Up: Value Function Approximation
Tabular methods maintain a separate value for every state-action pair—infeasible when the state space is large or continuous (pixel-based Atari observations, joint angles of a robot). Value function approximation (VFA) replaces the table with a parameterized function $\hat{Q}(s, a; \mathbf{w}) \approx Q^\pi(s, a)$, offering three advantages:
- Memory: a compact vector $\mathbf{w}$ instead of $|\mathcal{S}| \times |\mathcal{A}|$ entries.
- Computation: faster updates when the parameter count is much smaller than the state-action space.
- Generalization: updating $\hat{Q}$ at one state-action pair automatically changes estimates at similar states, reducing the experience needed.
Learning with Stochastic Gradient Descent
Suppose an oracle provided the true $Q^\pi(s,a)$ at any query point. We would minimize:
$$J(\mathbf{w}) = \mathbb{E}_\pi \bigl[ (Q^\pi(s,a) - \hat{Q}(s,a;\mathbf{w}))^2 \bigr]$$Gradient descent gives the update direction:
$$\Delta \mathbf{w} = -\frac{1}{2}\alpha\, \nabla_\mathbf{w} J(\mathbf{w}) = \alpha\, \mathbb{E}_\pi \bigl[ (Q^\pi(s,a) - \hat{Q}(s,a;\mathbf{w})) \nabla_\mathbf{w} \hat{Q}(s,a;\mathbf{w}) \bigr]$$In practice, we sample transitions and use stochastic gradient descent (SGD), replacing the expectation with a single-sample estimate. The expected SGD update equals the full gradient, so it is unbiased.
Monte Carlo VFA
The return $G_t$ is an unbiased (but noisy) sample of $Q^\pi(s_t, a_t)$, so we substitute it for the true value. MC VFA reduces to supervised learning on pairs $\langle (s_t, a_t), G_t \rangle$ with the update:
$$\Delta \mathbf{w} = \alpha \bigl( G_t - \hat{Q}(s_t, a_t; \mathbf{w}) \bigr) \nabla_\mathbf{w} \hat{Q}(s_t, a_t; \mathbf{w})$$TD(0) VFA
For TD(0), the target $r + \gamma \hat{V}(s'; \mathbf{w})$ replaces the true value, introducing three layers of approximation: sampling (one transition instead of an expectation), bootstrapping ($\hat{V}$ instead of true $V^\pi$), and function approximation (parameterized function instead of a table). The update:
$$\Delta \mathbf{w} = \alpha \bigl( r + \gamma\, \hat{V}(s'; \mathbf{w}) - \hat{V}(s; \mathbf{w}) \bigr) \nabla_\mathbf{w} \hat{V}(s; \mathbf{w})$$The gradient is only taken with respect to $\hat{V}(s; \mathbf{w})$, not the target—a semi-gradient method. Despite the theoretical complications, semi-gradient TD works well in practice with linear function approximation.
Control with Function Approximation
For control with VFA, combine approximate policy evaluation (MC or TD targets with parameterized $\hat{Q}$) with $\epsilon$-greedy improvement. The update rules generalize naturally:
MC control:
$$\Delta \mathbf{w} = \alpha \bigl( G_t - \hat{Q}(s_t, a_t; \mathbf{w}) \bigr) \nabla_\mathbf{w} \hat{Q}(s_t, a_t; \mathbf{w})$$SARSA with VFA:
$$\Delta \mathbf{w} = \alpha \bigl( r + \gamma\, \hat{Q}(s', a'; \mathbf{w}) - \hat{Q}(s, a; \mathbf{w}) \bigr) \nabla_\mathbf{w} \hat{Q}(s, a; \mathbf{w})$$Q-learning with VFA:
$$\Delta \mathbf{w} = \alpha \bigl( r + \gamma \max_{a'} \hat{Q}(s', a'; \mathbf{w}) - \hat{Q}(s, a; \mathbf{w}) \bigr) \nabla_\mathbf{w} \hat{Q}(s, a; \mathbf{w})$$However, combining function approximation with control introduces potential instability.
The Deadly Triad
Three individually useful techniques can cause divergence or oscillation when combined:
- Function approximation — parameterized model rather than a table.
- Bootstrapping — updating estimates from other estimates (TD), rather than waiting for complete returns.
- Off-policy learning — learning about one policy while following another (Q-learning).
Any two can be combined safely; the danger emerges when all three are present. Intuitively, the Bellman backup is a contraction in the tabular case, but projecting back onto the function approximation space can be an expansion, amplifying rather than shrinking errors. Off-policy learning exacerbates this because the state distribution under the behavior policy may differ significantly from that of the target policy.
Deep Q-Networks (DQN)
Mnih et al. (2015) showed that Q-learning with a deep neural network could learn Atari games directly from pixels, matching or exceeding human performance. Making this work required addressing two problems that arise when naively combining neural networks with Q-learning:
- Correlated samples: consecutive transitions are highly correlated, violating the i.i.d. assumption of SGD.
- Non-stationary targets: the target $r + \gamma \max_{a'} \hat{Q}(s', a'; \mathbf{w})$ depends on the same weights being updated, so it shifts with every gradient step.
DQN addresses both issues with two key innovations: experience replay and fixed target networks.
Experience Replay
Instead of learning from only the most recent transition, DQN stores transitions $(s, a, r, s')$ in a large replay buffer $\mathcal{D}$. At each step, a random minibatch is sampled from $\mathcal{D}$ for the gradient update. This breaks temporal correlations (the minibatch mixes transitions from many episodes and time steps) and improves data efficiency by reusing each experience many times.
Fixed Q-Targets
To stabilize the target, DQN maintains a separate target network with parameters $\mathbf{w}^-$ held fixed for $C$ steps:
$$y_i = r_i + \gamma \max_{a'} \hat{Q}(s_{i+1}, a'; \mathbf{w}^-)$$The main network $\mathbf{w}$ is updated to minimize $(y_i - \hat{Q}(s_i, a_i; \mathbf{w}))^2$, while $\mathbf{w}^-$ is synchronized with $\mathbf{w}$ only every $C$ steps. This provides a stable regression target for many consecutive updates. While this doubles memory (two copies of the weights), it does not double computation since the target network only does forward passes.
- Input: target update frequency $C$, learning rate $\alpha$.
- Initialize replay buffer $\mathcal{D} = \{\}$, network weights $\mathbf{w}$, target weights $\mathbf{w}^- = \mathbf{w}$, $t = 0$.
- Get initial state $s_0$.
- Loop:
- Choose $a_t$ using $\epsilon$-greedy policy w.r.t. $\hat{Q}(s_t, \cdot\,; \mathbf{w})$.
- Execute $a_t$, observe $(r_t, s_{t+1})$.
- Store $(s_t, a_t, r_t, s_{t+1})$ in $\mathcal{D}$.
- Sample random minibatch $\{(s_j, a_j, r_j, s_{j+1})\}$ from $\mathcal{D}$.
- For each $(s_j, a_j, r_j, s_{j+1})$ in minibatch:
- If $s_{j+1}$ is terminal: $y_j = r_j$
- Else: $y_j = r_j + \gamma \max_{a'} \hat{Q}(s_{j+1}, a'; \mathbf{w}^-)$
- $\Delta \mathbf{w} = \alpha (y_j - \hat{Q}(s_j, a_j; \mathbf{w})) \nabla_\mathbf{w} \hat{Q}(s_j, a_j; \mathbf{w})$
- $t \leftarrow t + 1$.
- If $t \bmod C = 0$: $\mathbf{w}^- \leftarrow \mathbf{w}$ // sync target network
DQN in Practice: Atari Results
The DQN architecture takes the last 4 raw game frames (pixels), processes them through convolutional layers, and outputs $Q(s, a)$ for each of the 18 possible joystick/button positions. The reward is simply the change in game score. Remarkably, the same architecture and hyperparameters were used across all games.
Ablation experiments reveal that experience replay is by far the most important ingredient:
- Linear baseline: reasonable scores with simple features.
- Deep network alone (no replay, no fixed targets): often worse than linear—simply adding depth can hurt.
- DQN with fixed targets only: moderate improvement over the deep baseline.
- DQN with replay only: large improvements (e.g., Breakout jumps from 3 to 241).
- Full DQN (replay + fixed targets): best performance (Breakout reaches 317, River Raid reaches 7447).
Beyond decorrelating samples, replay provides massive data efficiency gains—each transition feeds many gradient updates instead of just one.
Beyond DQN
DQN sparked a wave of improvements:
- Double DQN (Van Hasselt et al., 2016): addresses the overestimation bias of Q-learning by decoupling action selection from evaluation in the target.
- Prioritized Experience Replay (Schaul et al., 2016): samples transitions with high TD error more frequently, focusing learning where it is most needed.
- Dueling DQN (Wang et al., 2016): separates the network into a state-value stream and an advantage stream, improving learning when many actions have similar values.
Summary
This lecture covered the pipeline from model-free evaluation to model-free control, then scaled these ideas to large state spaces with function approximation.
- Generalized Policy Iteration (GPI) alternates evaluation and improvement. Using $Q$ values (rather than $V$) avoids the need for a dynamics model.
- Epsilon-greedy exploration ensures all actions are tried while mostly exploiting current knowledge. Under GLIE conditions ($\epsilon \to 0$ at rate $1/k$), MC control converges to $Q^*$.
- SARSA is on-policy TD control: it updates $Q(s_t, a_t)$ toward $r_t + \gamma Q(s_{t+1}, a_{t+1})$ using the action actually taken.
- Q-learning is off-policy TD control: it updates toward $r_t + \gamma \max_{a'} Q(s_{t+1}, a')$, directly estimating $Q^*$ regardless of the behavior policy.
- Function approximation replaces the lookup table with $\hat{Q}(s, a; \mathbf{w})$, enabling generalization across states. SGD-based updates extend naturally from the tabular case.
- The deadly triad—function approximation + bootstrapping + off-policy learning—can cause divergence. Care must be taken when all three are present.
- DQN stabilizes deep Q-learning via experience replay (breaks correlations, improves data efficiency) and fixed target networks (stabilizes the regression target), enabling human-level Atari play from raw pixels.