Lecture 3: Model-Free Policy Evaluation

Why Model-Free Evaluation?

In the previous lecture, we saw how to evaluate a policy $\pi$ when we have complete knowledge of the MDP's dynamics $P(s' \mid s, a)$ and reward function $R(s, a)$. Dynamic programming gave us a clean iterative solution: apply the Bellman operator repeatedly until convergence. But in most real-world problems, we do not have access to those models. A hospital does not know the precise transition probabilities governing patient outcomes, and a robotics engineer cannot write down the exact physics of every contact interaction. We need methods that can estimate a policy's value directly from experience—trajectories of states, actions, and rewards collected by executing the policy in the environment.

This lecture introduces three families of model-free policy evaluation methods: Monte Carlo (MC) estimation, Temporal Difference (TD) learning, and the certainty-equivalence (model-based-from-data) approach. We will study their algorithms, analyze their statistical properties (bias, variance, consistency), and compare what they converge to on a fixed batch of data.

Background: Value Functions and Returns

Let us recall the quantities we want to estimate. All definitions below assume a policy $\pi$ is being followed inside an MDP.

Definition — Return

The return $G_t$ from time step $t$ is the discounted sum of future rewards:

$$G_t = r_t + \gamma\, r_{t+1} + \gamma^2 r_{t+2} + \gamma^3 r_{t+3} + \cdots$$

where $\gamma \in [0, 1)$ is the discount factor.

Definition — State Value Function

The state value function $V^\pi(s)$ is the expected return when starting in state $s$ and following policy $\pi$ thereafter:

$$V^\pi(s) = \mathbb{E}_\pi\!\left[\, G_t \mid s_t = s \,\right] = \mathbb{E}_\pi\!\left[\, r_t + \gamma\, r_{t+1} + \gamma^2 r_{t+2} + \cdots \mid s_t = s \,\right]$$

Definition — Action Value Function

The action value function $Q^\pi(s, a)$ is the expected return when starting in state $s$, taking action $a$, and then following $\pi$:

$$Q^\pi(s, a) = \mathbb{E}_\pi\!\left[\, G_t \mid s_t = s,\, a_t = a \,\right]$$

When the dynamics and rewards are known, the Bellman equation lets us compute $V^\pi$ via dynamic programming:

$$V^\pi_k(s) = R(s, \pi(s)) + \gamma \sum_{s' \in \mathcal{S}} P(s' \mid s, \pi(s))\, V^\pi_{k-1}(s')$$

The key idea here is bootstrapping: substituting our current estimate $V^\pi_{k-1}(s')$ for the true future expected return. This idea reappears in TD learning, but applied to sampled transitions rather than full sweeps over the state space.

Monte Carlo Policy Evaluation

The simplest model-free approach is to take the definition of $V^\pi(s)$ literally. Since $V^\pi(s) = \mathbb{E}_\pi[G_t \mid s_t = s]$, we can estimate this expectation by its sample mean: generate many trajectories under $\pi$, compute the return from every visit to state $s$, and average them.

Monte Carlo methods have several attractive properties:

They do not require knowledge of the MDP dynamics or rewards.
They do not even require the Markov property—the estimator is valid for any stochastic process.

The price is that we must wait until an episode terminates to compute $G_t$, so MC is restricted to episodic settings where every trajectory eventually ends.

First-Visit Monte Carlo

In first-visit MC, we only use the first time a state $s$ is encountered within each episode. If the agent visits $s$ at time steps 2, 7, and 15 in one episode, only the return $G_2$ from the first visit contributes to the estimate. This restriction ensures the samples are independent across episodes, giving us an unbiasedness guarantee.

Algorithm: First-Visit Monte Carlo Policy Evaluation

Initialize $N(s) = 0$ and $G_{\text{total}}(s) = 0$ for all $s \in \mathcal{S}$.
Loop (for each episode $i$):
1. Sample episode $i$: $s_{i,1}, a_{i,1}, r_{i,1}, s_{i,2}, a_{i,2}, r_{i,2}, \ldots, s_{i,T_i}, a_{i,T_i}, r_{i,T_i}$.
2. Compute return: $G_{i,t} = r_{i,t} + \gamma\, r_{i,t+1} + \gamma^2 r_{i,t+2} + \cdots + \gamma^{T_i - t}\, r_{i,T_i}$.
3. For each time step $t = 1, \ldots, T_i$:
  1. If $s_{i,t}$ has not been visited earlier in episode $i$:
  2. $N(s_{i,t}) \leftarrow N(s_{i,t}) + 1$
  3. $G_{\text{total}}(s_{i,t}) \leftarrow G_{\text{total}}(s_{i,t}) + G_{i,t}$
  4. $V^\pi(s_{i,t}) \leftarrow G_{\text{total}}(s_{i,t})\, /\, N(s_{i,t})$

Every-Visit Monte Carlo

Every-visit MC uses all visits to state $s$ within each episode, not just the first. The algorithm is identical to the one above, except the conditional "if this is the first visit" is removed—we always update. The samples within a single episode are no longer independent (later visits overlap with earlier returns), making the estimator biased for a finite number of episodes. However, it is still consistent: the bias vanishes as the number of episodes grows. In practice, every-visit MC often achieves lower mean squared error (MSE) than first-visit MC because it uses each episode more efficiently.

Example — Mars Rover MC Evaluation

Consider a Mars rover MDP with seven states $s_1, \ldots, s_7$ and reward vector $R(s) = [+1, 0, 0, 0, 0, 0, +10]$. Suppose the policy always takes action $a_1$, and we observe the trajectory:

$$(s_3, a_1, 0,\; s_2, a_1, 0,\; s_2, a_1, 0,\; s_1, a_1, +1,\; \text{terminal})$$

With discount $\gamma < 1$, the first-visit MC estimate of $V^\pi(s_2)$ uses only the return from the first visit at time step 2: $G_2 = 0 + \gamma \cdot 0 + \gamma^2 \cdot (+1) = \gamma^2$. The every-visit estimate additionally incorporates the return from the second visit at time step 3: $G_3 = 0 + \gamma \cdot (+1) = \gamma$. The every-visit estimate of $V^\pi(s_2)$ is $(\gamma^2 + \gamma) / 2$.

Incremental Monte Carlo Updates

Maintaining running sums and counts is equivalent to an incremental update after each episode. Writing $N(s)$ for the updated visit count:

$$V^\pi(s) \leftarrow V^\pi(s) + \frac{1}{N(s)}\Big(G_{i,t} - V^\pi(s)\Big)$$

More generally, we can replace $1/N(s)$ with a fixed or decaying learning rate $\alpha$:

$$V^\pi(s_t) \leftarrow V^\pi(s_t) + \alpha\,\Big(G_{i,t} - V^\pi(s_t)\Big)$$

This "error-correction" pattern is fundamental—it reappears in TD learning, SARSA, Q-learning, and many other algorithms throughout the course. The quantity $G_{i,t} - V^\pi(s_t)$ is the MC error: how far the observed return was from our current prediction.

Convergence of Incremental MC

If the learning rate schedule $\{\alpha_n(s)\}$ satisfies the Robbins-Monro conditions:

$$\sum_{n=1}^{\infty} \alpha_n(s) = \infty \qquad \text{and} \qquad \sum_{n=1}^{\infty} \alpha_n^2(s) < \infty$$

then incremental Monte Carlo converges to the true value $V^\pi(s)$ for all $s$. The first condition ensures the step sizes are large enough to overcome any initial bias; the second ensures they shrink fast enough for the variance to vanish. A standard choice is $\alpha_n = 1/n$.

Limitations of Monte Carlo Methods

Monte Carlo estimators are generally high variance. The return $G_t$ is a sum of many random variables—each transition and reward along the trajectory—and the variance of that sum can be substantial. Reducing variance to an acceptable level may require many episodes, which is problematic when data is expensive: clinical trials, physical robot experiments, or high-stakes financial decisions.

MC also requires episodes to terminate. In continuing tasks—a server managing requests indefinitely, a thermostat regulating temperature around the clock—there is no natural episode boundary, so pure MC cannot be applied.

Temporal Difference Learning

Temporal Difference learning addresses both limitations of MC. As Sutton and Barto write, "If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal-difference learning." TD combines sampling from Monte Carlo with bootstrapping from dynamic programming, producing an algorithm that updates after every transition and works in non-episodic settings.

TD(0): The One-Step TD Algorithm

In Monte Carlo, the update target is the full return $G_{i,t}$. In TD(0), we replace it with a one-step estimate: observe the immediate reward $r_t$ and next state $s_{t+1}$, then bootstrap by plugging in our current estimate $V^\pi(s_{t+1})$ for the remaining future return:

$$V^\pi(s_t) \leftarrow V^\pi(s_t) + \alpha\,\Big[\underbrace{r_t + \gamma\, V^\pi(s_{t+1})}_{\text{TD target}} - V^\pi(s_t)\Big]$$

Definition — TD Target and TD Error

The TD target is the quantity $r_t + \gamma\, V^\pi(s_{t+1})$. It serves as a one-step sample-based approximation to the return $G_t$.

The TD error (also called the temporal difference) is:

$$\delta_t = r_t + \gamma\, V^\pi(s_{t+1}) - V^\pi(s_t)$$

It measures how much our value estimate must be corrected after observing a single transition.

Algorithm: TD(0) Policy Evaluation

Input: learning rate $\alpha$, policy $\pi$, discount factor $\gamma$.
Initialize $V^\pi(s) = 0$ for all $s \in \mathcal{S}$.
Loop (for each time step):
1. Observe current state $s_t$, take action $a_t \sim \pi(s_t)$, receive reward $r_t$, observe next state $s_{t+1}$.
2. $V^\pi(s_t) \leftarrow V^\pi(s_t) + \alpha\,\big[r_t + \gamma\, V^\pi(s_{t+1}) - V^\pi(s_t)\big]$

Each update requires just a single transition $(s_t, a_t, r_t, s_{t+1})$—no need to wait for an episode to end. This makes TD(0) applicable to continuing tasks with no terminal state, and lets the agent learn online, refining its estimates after every step.

Example — Mars Rover TD(0) Evaluation

Using the same Mars rover MDP with $R(s) = [+1, 0, 0, 0, 0, 0, +10]$ and trajectory $(s_3, a_1, 0,\; s_2, a_1, 0,\; s_2, a_1, 0,\; s_1, a_1, +1,\; \text{terminal})$, suppose $\gamma = 1$ and $\alpha = 1$ (for simplicity), with all values initialized to zero.

The TD updates proceed one transition at a time, right-to-left in the trajectory. After the transition $(s_1, a_1, +1, \text{terminal})$, we get $V(s_1) \leftarrow 0 + 1 \cdot [+1 + 1 \cdot 0 - 0] = +1$. Working backwards through $(s_2, a_1, 0, s_1)$ after $V(s_1)$ has been updated gives $V(s_2) \leftarrow 0 + 1 \cdot [0 + 1 \cdot (+1) - 0] = +1$. But note: TD processes tuples in the order they occur. The first update is $(s_3 \to s_2)$, which gives $V(s_3) \leftarrow 0 + 1 \cdot [0 + 0 - 0] = 0$ since $V(s_2) = 0$ at that point. Information propagates slowly—one step per update—unlike MC, which propagates the entire return immediately.

Insight — Bootstrapping

The central idea in TD learning is bootstrapping: using the agent's own current estimate $V^\pi(s_{t+1})$ as a stand-in for the true expected future return. In dynamic programming, bootstrapping is exact because we sum over all possible next states weighted by their true probabilities. In TD, we combine bootstrapping with sampling—we see only one next state $s_{t+1}$, drawn from the actual environment. This combination is what makes TD uniquely powerful.

Bias-Variance Tradeoff: MC vs. TD

Monte Carlo and TD sit at opposite ends of a spectrum, and their tradeoffs are best understood through bias and variance.

Definition — Bias, Variance, and MSE

For an estimator $\hat{\theta}$ of a parameter $\theta$:

Bias: $\text{Bias}_\theta(\hat{\theta}) = \mathbb{E}_{x \mid \theta}[\hat{\theta}] - \theta$

Variance: $\text{Var}(\hat{\theta}) = \mathbb{E}_{x \mid \theta}\!\left[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2\right]$

Mean Squared Error: $\text{MSE}(\hat{\theta}) = \text{Var}(\hat{\theta}) + \text{Bias}_\theta(\hat{\theta})^2$

An estimator is consistent if $\lim_{n \to \infty} \Pr(|\hat{\theta}_n - \theta| > \epsilon) = 0$ for all $\epsilon > 0$.

Monte Carlo: unbiased, high variance

First-visit MC uses the actual sampled return $G_t$ as its target. Since $\mathbb{E}_\pi[G_t \mid s_t = s] = V^\pi(s)$ by definition, the target is an unbiased estimate of the true value. However, $G_t$ depends on a long chain of random transitions and rewards, so its variance grows with episode length and environment stochasticity.

TD(0): biased, lower variance

TD(0) uses the target $r_t + \gamma V^\pi(s_{t+1})$. Because our current estimate $V^\pi(s_{t+1})$ is generally wrong, the TD target is a biased estimate of $V^\pi(s_t)$—especially early on, when initialization dominates. However, it depends on only one random transition (from $s_t$ to $s_{t+1}$), so its variance is much lower than the full-return target used by MC.

Insight — The Bias-Variance Spectrum

MC has zero bias but high variance because it uses the full (noisy) return. TD(0) has low variance but nonzero bias because it substitutes an imperfect estimate for the tail of the return. As value estimates improve, TD's bias shrinks, and under appropriate learning-rate conditions both methods converge to $V^\pi$. In practice, TD's lower variance often means it learns a usable estimate faster than MC, even though each individual update carries some bias.

TD(0) is also consistent under the same Robbins-Monro conditions ($\sum \alpha_n = \infty$, $\sum \alpha_n^2 < \infty$). TD exploits the Markov property: by bootstrapping from $V^\pi(s_{t+1})$, it leverages the fact that the future is conditionally independent of the past given the present state. MC does not rely on this property, so MC can be applied to non-Markov environments where TD cannot.

Interpolating MC and TD: $n$-Step Returns and TD($\lambda$)

Since MC and TD(0) sit at opposite ends of the bias-variance spectrum, can we interpolate between them? Yes—and there are two elegant ways to do so.

$n$-Step Returns

Instead of bootstrapping after exactly one step, we can take $n$ real steps and then bootstrap:

$$G_t^{(n)} = r_t + \gamma\, r_{t+1} + \gamma^2 r_{t+2} + \cdots + \gamma^{n-1} r_{t+n-1} + \gamma^n V^\pi(s_{t+n})$$

When $n = 1$ this recovers TD(0); when $n = \infty$ (or $n$ reaches the episode end) it recovers the full MC return. Intermediate values trade off the low variance of short bootstraps against the low bias of long rollouts:

$$V^\pi(s_t) \leftarrow V^\pi(s_t) + \alpha\,\big[G_t^{(n)} - V^\pi(s_t)\big]$$

TD($\lambda$) and Eligibility Traces

Rather than committing to a single $n$, TD($\lambda$) takes a geometrically weighted average over all $n$-step returns, controlled by $\lambda \in [0, 1]$:

$$G_t^\lambda = (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{n-1}\, G_t^{(n)}$$

When $\lambda = 0$, only the 1-step return survives, recovering TD(0). When $\lambda = 1$, the full Monte Carlo return dominates. Intermediate values blend short-horizon and long-horizon returns smoothly.

In practice, TD($\lambda$) is implemented with eligibility traces—a per-state memory $e_t(s)$ that records how recently and frequently each state has been visited. When a TD error $\delta_t$ is observed, all states are updated in proportion to their eligibility:

$$e_t(s) = \gamma \lambda\, e_{t-1}(s) + \mathbf{1}(s_t = s)$$

$$V^\pi(s) \leftarrow V^\pi(s) + \alpha\, \delta_t\, e_t(s) \quad \text{for all } s$$

Eligibility traces propagate TD errors backward through time without storing entire trajectories, unifying the forward view (which $n$-step return are we targeting?) with the backward view (how do we distribute credit to past states?).

Insight — The $\lambda$ Continuum

$\lambda$ gives you a smooth knob between high-variance Monte Carlo ($\lambda = 1$) and low-variance, bootstrapped TD(0) ($\lambda = 0$). In practice, intermediate values like $\lambda = 0.8$ or $\lambda = 0.9$ often yield the best performance, capturing the benefits of both ends.

Certainty Equivalence: A Model-Based Approach from Data

A third strategy does not fit neatly into the "model-free" category. Instead of directly estimating $V^\pi$, we first estimate the MDP model from data and then apply dynamic programming to the estimated model.

From a collection of transitions $(s_k, a_k, r_k, s_{k+1})$, we compute maximum likelihood estimates:

$$\hat{P}(s' \mid s, a) = \frac{\sum_{k} \mathbf{1}(s_k = s,\; a_k = a,\; s_{k+1} = s')}{\sum_{k} \mathbf{1}(s_k = s,\; a_k = a)}$$

$$\hat{R}(s, a) = \frac{\sum_{k} \mathbf{1}(s_k = s,\; a_k = a)\, r_k}{\sum_{k} \mathbf{1}(s_k = s,\; a_k = a)}$$

We then treat this estimated MDP as if it were the true model (hence "certainty equivalence") and solve for $V^\pi$ using any DP method from Lecture 2.

This approach is very data-efficient but computationally expensive: recomputing the MDP solution after each new data point costs $O(|\mathcal{S}|^3)$ for the matrix-inverse solution or $O(|\mathcal{S}|^2 |\mathcal{A}|)$ for iterative methods. It is consistent when the process is truly Markov and naturally supports off-policy evaluation since the model can be built from any data source.

Batch Methods: What MC and TD Converge To

Given a fixed, finite dataset of episodes, we can repeatedly cycle through the data applying MC or TD updates until convergence. This "batch" setting reveals a subtle but important difference between the two methods.

Example — The AB Problem (Sutton & Barto, Example 6.4)

Consider two states $A$ and $B$ with $\gamma = 1$. We observe eight episodes of experience:

$A, 0, B, 0$ (one episode)
$B, 1$ (six episodes)
$B, 0$ (one episode)

Both MC and TD agree that $V(B) = 6/8 = 0.75$ (the average reward from $B$).

But what about $V(A)$? State $A$ was visited once, and the observed return from that episode was $0 + 0 = 0$. Batch MC converges to $V(A) = 0$—the average observed return from $A$. Batch TD converges to $V(A) = 0.75$—because it learns $V(A) = 0 + \gamma \cdot V(B) = 0.75$, exploiting the Markov structure: every time $A$ was observed, it transitioned to $B$.

This example crystallizes the difference. Batch MC minimizes mean squared error between predictions and observed returns, without assuming any structure. Batch TD finds the value function consistent with the maximum likelihood MDP model implied by the data—exactly the certainty-equivalence solution. When the Markov assumption holds, TD is statistically more efficient (lower MSE for small datasets) because it leverages the MDP's factored structure. When Markov is violated, MC may be more robust.

Comparing the Three Approaches

Let us summarize the tradeoffs along several dimensions that matter for practical algorithm selection.

Consistency

All three—MC, TD, and certainty equivalence—converge to $V^\pi$ given sufficient data and appropriate learning rates. First-visit MC and certainty equivalence are unbiased; every-visit MC and TD(0) are biased but consistent.

Computational cost

Incremental MC and TD(0) both update in $O(1)$ per transition per state. Certainty equivalence requires solving a system of $|\mathcal{S}|$ equations, costing $O(|\mathcal{S}|^2)$ to $O(|\mathcal{S}|^3)$ per model update.

Data efficiency

Certainty equivalence is the most data-efficient, since it extracts a full model. TD is generally more data-efficient than MC in Markov environments because bootstrapping lets information propagate faster. MC can be more efficient in non-Markov environments where bootstrapping is violated.

Applicability

MC requires episodic tasks. TD and certainty equivalence work in both episodic and continuing settings. MC does not require the Markov property; TD and certainty equivalence do.

Summary

This lecture established the foundations of model-free policy evaluation:

Monte Carlo methods estimate $V^\pi(s)$ by averaging sampled returns—simple and unbiased, but requiring complete episodes and suffering from high variance.
Temporal Difference learning bootstraps from its own value estimates to update after every transition—lower variance but biased, and no need for episodes to terminate.
The $n$-step return and TD($\lambda$) frameworks provide a principled continuum between these extremes, controlled by a single parameter.
The certainty-equivalence approach estimates an MDP model from data and solves it exactly, trading computational cost for maximum data efficiency.

All of these address the prediction problem: estimating the value of a given, fixed policy $\pi$. In the next lecture, we turn to the control problem: how to use these evaluation tools to find a better policy, still without access to the true MDP model.