Recap: REINFORCE and the Variance Problem
In the previous lecture we derived the REINFORCE algorithm, a Monte Carlo policy gradient method that updates policy parameters by ascending the gradient of expected return. The core update rule uses the likelihood ratio trick to express the policy gradient as an expectation under the current policy:
$$\nabla_\theta V(\theta) \approx \frac{1}{m} \sum_{i=1}^{m} \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t^{(i)} \mid s_t^{(i)}) \, G_t^{(i)}$$where $G_t^{(i)} = \sum_{t'=t}^{T-1} r_{t'}^{(i)}$ is the empirical return from time step $t$ onward in the $i$-th sampled trajectory. The temporal structure improvement (only weighting by future rewards rather than the entire trajectory return) already helps, but this estimator remains unbiased yet extremely high-variance. In practice, the raw REINFORCE estimator is too noisy to be useful without further refinement. This lecture introduces three key ideas for taming that variance: baselines, the advantage function, and the actor-critic architecture.
Introducing a Baseline
The central idea of a baseline is simple: instead of weighting each log-probability gradient by the raw return $G_t$, we subtract a state-dependent function $b(s_t)$ and use $G_t - b(s_t)$ instead. The resulting policy gradient estimator is:
$$\nabla_\theta \mathbb{E}_\tau[R] = \mathbb{E}_\tau\!\left[\sum_{t=0}^{T-1} \nabla_\theta \log \pi(a_t \mid s_t;\, \theta) \left(\sum_{t'=t}^{T-1} r_{t'} - b(s_t)\right)\right]$$The Baseline Does Not Introduce Bias
For any function $b(s)$ that depends only on the state, the expected gradient remains unchanged — variance reduction with no bias penalty.
Proof
We break the expectation over the full trajectory into nested expectations. Since $b(s_t)$ depends only on $s_t$ (and the history up to $s_t$), and $\nabla_\theta \log \pi(a_t \mid s_t;\, \theta)$ depends on $a_t$ given $s_t$, we can condition on the history up to state $s_t$ and marginalize over the action:
$$\mathbb{E}_\tau\!\left[\nabla_\theta \log \pi(a_t \mid s_t;\, \theta)\, b(s_t)\right] = \mathbb{E}_{s_{0:t}, a_{0:(t-1)}}\!\left[b(s_t)\, \mathbb{E}_{a_t}\!\left[\nabla_\theta \log \pi(a_t \mid s_t;\, \theta)\right]\right]$$Now we evaluate the inner expectation over $a_t$. Using the likelihood ratio identity in reverse:
$$\mathbb{E}_{a_t}\!\left[\nabla_\theta \log \pi(a_t \mid s_t;\, \theta)\right] = \sum_a \pi(a \mid s_t;\, \theta)\, \frac{\nabla_\theta \pi(a \mid s_t;\, \theta)}{\pi(a \mid s_t;\, \theta)} = \sum_a \nabla_\theta \pi(a \mid s_t;\, \theta) = \nabla_\theta \sum_a \pi(a \mid s_t;\, \theta) = \nabla_\theta\, 1 = 0$$Since the inner expectation is zero regardless of $s_t$, the entire expression vanishes. $\square$
The Optimal Baseline
Since any baseline preserves unbiasedness, we should choose the one that minimizes variance. The variance of a single gradient term is:
$$\text{Var}\!\left[\nabla_\theta \log \pi(a_t \mid s_t;\, \theta)\,(G_t - b(s_t))\right]$$Minimizing over $b$ leads to a weighted least-squares problem. Taking the derivative with respect to $b(s)$ and setting it to zero yields the optimal baseline:
$$b^*(s) = \frac{\mathbb{E}_{a \sim \pi(\cdot|s),\, G \mid s, a}\!\left[\|\nabla_\theta \log \pi(a \mid s;\, \theta)\|^2\, G_t(s)\right]}{\mathbb{E}_{a \sim \pi(\cdot|s),\, G \mid s, a}\!\left[\|\nabla_\theta \log \pi(a \mid s;\, \theta)\|^2\right]}$$This is a weighted average of returns, with weights given by the squared score-function magnitudes. In practice those weights are roughly constant across actions, so the optimal baseline simplifies to:
$$b^*(s) \approx \mathbb{E}[G_t \mid s_t = s] = V^\pi(s)$$This gives strong motivation for using $V^\pi(s)$ as the baseline — it is near-optimal and has a clean interpretation.
The Advantage Function
When we use $V^\pi(s)$ as the baseline and $Q^\pi(s, a)$ in place of the Monte Carlo return, the resulting quantity has a name and a deep significance.
With this definition, the policy gradient takes the elegant form:
$$\nabla_\theta \mathbb{E}_\tau[R] \approx \mathbb{E}_\tau\!\left[\sum_{t=0}^{T-1} \nabla_\theta \log \pi(a_t \mid s_t;\, \theta)\, A^\pi(s_t, a_t)\right]$$Recall the definitions of $Q^\pi$ and $V^\pi$:
$$Q^\pi(s, a) = \mathbb{E}_\pi\!\left[r_0 + \gamma r_1 + \gamma^2 r_2 + \cdots \mid s_0 = s,\, a_0 = a\right]$$ $$V^\pi(s) = \mathbb{E}_\pi\!\left[r_0 + \gamma r_1 + \gamma^2 r_2 + \cdots \mid s_0 = s\right] = \mathbb{E}_{a \sim \pi}[Q^\pi(s, a)]$$Vanilla Policy Gradient with Baseline
Putting together temporal structure, the baseline, and the advantage function, we arrive at the "vanilla" policy gradient algorithm — the practical starting point for most policy gradient implementations.
- Initialize policy parameters $\theta$ and baseline $b$.
- for iteration $= 1, 2, \ldots$ do
- Collect a set of trajectories $\{\tau^{(i)}\}$ by executing the current policy $\pi_\theta$.
- for each timestep $t$ in each trajectory $\tau^{(i)}$ do
- Compute the return: $G_t^{(i)} = \sum_{t'=t}^{T-1} r_{t'}^{(i)}$
- Compute the advantage estimate: $\hat{A}_t^{(i)} = G_t^{(i)} - b(s_t^{(i)})$
- end for
- Re-fit the baseline by minimizing $\sum_{i,t} |b(s_t^{(i)}) - G_t^{(i)}|^2$.
- Compute the policy gradient estimate: $\hat{g} = \sum_{i,t} \nabla_\theta \log \pi(a_t \mid s_t;\, \theta)\, \hat{A}_t$.
- Update $\theta$ using $\hat{g}$ via SGD or Adam.
- end for
This already offers a substantial improvement over bare REINFORCE, but it still relies on Monte Carlo returns $G_t$, which are high-variance because they accumulate randomness over an entire trajectory from time $t$ onward. Can we do better?
Actor-Critic Methods
The Monte Carlo return $G_t$ is an unbiased but high-variance estimate of $Q^\pi(s_t, a_t)$. Just as TD methods offered a lower-variance alternative to MC estimation in value-based RL, we can replace $G_t$ with a learned estimate of the value function. This introduces some bias from function approximation but can dramatically reduce variance.
- The actor: a parameterized policy $\pi_\theta(a \mid s)$ that selects actions.
- The critic: a learned value function $V_w(s)$ (or $Q_w(s, a)$) that evaluates how good the current state (or state-action pair) is.
TD Error as Advantage Estimate
The simplest actor-critic uses the one-step TD error as an estimate of the advantage:
$$\delta_t = r_t + \gamma V_w(s_{t+1}) - V_w(s_t)$$This is a biased but low-variance estimate of $A^\pi(s_t, a_t)$. When the critic is perfect ($V_w = V^\pi$), the expectation recovers the advantage exactly:
$$\mathbb{E}[\delta_t \mid s_t, a_t] = \mathbb{E}[r_t + \gamma V^\pi(s_{t+1}) \mid s_t, a_t] - V^\pi(s_t) = Q^\pi(s_t, a_t) - V^\pi(s_t) = A^\pi(s_t, a_t)$$In practice, function approximation introduces some bias, but the variance reduction is usually worth it.
- Initialize policy parameters $\theta$ and critic parameters $w$.
- for each episode do
- for $t = 0, 1, \ldots, T-1$ do
- Take action $a_t \sim \pi_\theta(\cdot \mid s_t)$, observe $r_t, s_{t+1}$.
- Compute TD error: $\delta_t = r_t + \gamma V_w(s_{t+1}) - V_w(s_t)$.
- Update critic: $w \leftarrow w + \alpha_w\, \delta_t\, \nabla_w V_w(s_t)$.
- Update actor: $\theta \leftarrow \theta + \alpha_\theta\, \delta_t\, \nabla_\theta \log \pi_\theta(a_t \mid s_t)$.
- end for
- end for
The critic uses the TD error to adjust its value estimates, while the actor uses the same TD error as a stand-in for the advantage to adjust the policy. This interleaving of policy improvement and policy evaluation is characteristic of all actor-critic methods.
N-Step Advantage Estimators
The one-step TD error sits at one extreme of a spectrum of advantage estimators. At the other extreme is the full Monte Carlo return. In between lie the $n$-step estimators, which blend bootstrapping and sampling:
$$\hat{R}_t^{(1)} = r_t + \gamma V(s_{t+1})$$ $$\hat{R}_t^{(2)} = r_t + \gamma r_{t+1} + \gamma^2 V(s_{t+2})$$ $$\vdots$$ $$\hat{R}_t^{(\infty)} = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots$$Subtracting the baseline $V(s_t)$ from each gives the corresponding advantage estimators:
$$\hat{A}_t^{(1)} = r_t + \gamma V(s_{t+1}) - V(s_t) \quad \text{(TD error, low variance, higher bias)}$$ $$\hat{A}_t^{(\infty)} = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots - V(s_t) \quad \text{(MC, high variance, low bias)}$$The one-step estimator $\hat{A}_t^{(1)}$ has low variance because it depends on only one step of randomness, but it carries bias from the critic's approximation error. The Monte Carlo estimator $\hat{A}_t^{(\infty)}$ is unbiased (no bootstrapping) but has high variance because it accumulates randomness over the entire trajectory. Intermediate values of $n$ offer a smooth trade-off, and in practice methods like Generalized Advantage Estimation (GAE) take a weighted combination across all $n$-step returns.
Problems with Vanilla Policy Gradients
Even with baselines and actor-critic enhancements, vanilla policy gradient methods suffer from two fundamental issues:
Poor sample efficiency. Policy gradients are inherently on-policy: each batch of data is used for a single gradient step and then discarded. The gradient estimate $\hat{g}$ is only valid for the current policy $\pi_\theta$; once $\theta$ changes, the old data no longer provides an unbiased estimate. Importance sampling can in principle reuse old data, but the trajectory-level importance weights tend to vanish or explode:
$$\frac{P(\tau_t \mid \theta)}{P(\tau_t \mid \theta')} = \prod_{t'=0}^{t} \frac{\pi_\theta(a_{t'} \mid s_{t'} )}{\pi_{\theta'}(a_{t'} \mid s_{t'})}$$Even for policies that are only slightly different, many small ratios multiply together to produce extreme values, making the estimator impractical for long horizons.
Sensitivity to step size. Policy gradient algorithms perform stochastic gradient ascent: $\theta_{k+1} = \theta_k + \alpha_k \hat{g}_k$. If $\alpha_k$ is too large, performance can collapse — the policy may enter a region of parameter space from which recovery is difficult. If too small, progress is unacceptably slow. The fundamental issue is that distance in parameter space does not correspond to distance in policy space: a small change in $\theta$ can produce a dramatic change in $\pi_\theta$ (near a sigmoid saturation point, for instance), and vice versa.
Policy Performance Bounds
To design update rules that respect distance in policy space, the key tool is the performance difference lemma.
The performance difference is expressed entirely in terms of the advantage of the old policy $\pi$, evaluated under the state-action distribution of the new policy $\pi'$. The catch: it still requires samples from $\pi'$, which is the policy we are trying to optimize.
The Surrogate Objective
To make this tractable, we approximate $d^{\pi'} \approx d^{\pi}$ and define the surrogate objective:
$$L_\pi(\pi') = \frac{1}{1-\gamma}\, \mathbb{E}_{\substack{s \sim d^{\pi} \\ a \sim \pi}}\!\left[\frac{\pi'(a \mid s)}{\pi(a \mid s)}\, A^\pi(s, a)\right]$$This can be estimated from old-policy trajectories: the importance weight $\pi'/\pi$ applies only at the action level, not across entire trajectories. A formal bound (Achiam et al., 2017) guarantees:
$$\left|J(\pi') - J(\pi) - L_\pi(\pi')\right| \leq C\, \sqrt{\mathbb{E}_{s \sim d^\pi}\!\left[D_{\mathrm{KL}}(\pi' \| \pi)[s]\right]}$$where $C$ depends on the discount factor and the range of the advantage. When $\pi'$ and $\pi$ are close in KL-divergence, the surrogate $L_\pi(\pi')$ faithfully approximates the true performance difference.
Proximal Policy Optimization (PPO)
The bound suggests a natural strategy: maximize $L_\pi(\pi')$ while keeping the KL-divergence small. Proximal Policy Optimization (PPO) implements this in two practical variants.
Variant 1: Adaptive KL Penalty
The first variant adds KL-divergence as a penalty term in an unconstrained problem:
$$\theta_{k+1} = \arg\max_\theta\; L_{\theta_k}(\theta) - \beta_k\, \bar{D}_{\mathrm{KL}}(\theta \| \theta_k)$$where $\bar{D}_{\mathrm{KL}}(\theta \| \theta_k) = \mathbb{E}_{s \sim d^{\pi_k}}[D_{\mathrm{KL}}(\pi_{\theta_k}(\cdot|s) \,\|\, \pi_\theta(\cdot|s))]$. The penalty coefficient $\beta_k$ adapts between iterations to approximately enforce a target KL-divergence $\delta$:
- If $\bar{D}_{\mathrm{KL}}(\theta_{k+1} \| \theta_k) \geq 1.5\,\delta$, the step was too large: double $\beta_{k+1} = 2\beta_k$.
- If $\bar{D}_{\mathrm{KL}}(\theta_{k+1} \| \theta_k) \leq \delta / 1.5$, the step was conservative: halve $\beta_{k+1} = \beta_k / 2$.
This adaptive scheme means the initial choice of $\beta_0$ is not critical — the coefficient quickly self-calibrates.
Variant 2: Clipped Objective
The second and more popular variant avoids computing KL-divergence entirely. It defines the probability ratio $r_t(\theta) = \pi_\theta(a_t \mid s_t) / \pi_{\theta_k}(a_t \mid s_t)$ and clips it to prevent large policy changes:
$$L_{\theta_k}^{\mathrm{CLIP}}(\theta) = \mathbb{E}_{\tau \sim \pi_k}\!\left[\sum_{t=0}^{T} \min\!\Big(r_t(\theta)\, \hat{A}_t^{\pi_k},\; \mathrm{clip}\big(r_t(\theta),\, 1 - \epsilon,\, 1 + \epsilon\big)\, \hat{A}_t^{\pi_k}\Big)\right]$$where $\epsilon$ is a hyperparameter (typically $\epsilon = 0.2$). The clipping works as follows:
- When $\hat{A}_t > 0$ (the action was better than expected), $r_t(\theta)$ is clipped from above at $1 + \epsilon$. The objective has no incentive to increase the probability ratio beyond $1 + \epsilon$.
- When $\hat{A}_t < 0$ (the action was worse than expected), $r_t(\theta)$ is clipped from below at $1 - \epsilon$. The objective has no incentive to decrease the probability ratio beyond $1 - \epsilon$.
In both cases, the minimum selects the more pessimistic estimate, preventing the optimization from exploiting policy changes large enough to invalidate the surrogate objective.
- Initialize policy parameters $\theta_0$ and clipping threshold $\epsilon$.
- for $k = 0, 1, 2, \ldots$ do
- Collect a set of partial trajectories $\mathcal{D}_k$ under policy $\pi_k = \pi(\theta_k)$.
- Estimate advantages $\hat{A}_t^{\pi_k}$ using any advantage estimation algorithm.
- Compute policy update: $$\theta_{k+1} = \arg\max_\theta\; L_{\theta_k}^{\mathrm{CLIP}}(\theta)$$ by taking $K$ steps of minibatch SGD (via Adam).
- end for
The clipped objective is simpler to implement than the KL-penalty variant and performs at least as well empirically. Crucially, $K$ gradient steps can be taken per batch of data, improving sample efficiency over vanilla policy gradients.
Empirical Performance
PPO has become one of the most widely used deep RL algorithms. It demonstrates strong performance across a range of continuous-control benchmarks (e.g., MuJoCo locomotion tasks) while being considerably simpler to implement and tune than trust-region methods like TRPO. Notably, PPO was a key algorithmic component in the training of ChatGPT and other RLHF-based systems, where it is used to fine-tune language models against a learned reward model of human preferences. Implementation details like reward scaling, learning rate annealing, and advantage normalization can significantly affect performance (Engstrom et al., ICLR 2020).
Monotonic Improvement Theory
The surrogate-plus-bound framework leads to a powerful guarantee. From the performance bound:
$$J(\pi') - J(\pi) \geq L_\pi(\pi') - C\, \sqrt{\mathbb{E}_{s \sim d^\pi}\!\left[D_{\mathrm{KL}}(\pi' \| \pi)[s]\right]}$$Define the update rule as:
$$\pi_{k+1} = \arg\max_{\pi'}\; L_{\pi_k}(\pi') - C\, \sqrt{\mathbb{E}_{s \sim d^{\pi_k}}\!\left[D_{\mathrm{KL}}(\pi' \| \pi_k)[s]\right]}$$Since $\pi_k$ itself is always a feasible point with objective value zero ($L_{\pi_k}(\pi_k) = 0$ and $D_{\mathrm{KL}}(\pi_k \| \pi_k) = 0$), the optimal value must be non-negative, and the performance bound then gives $J(\pi_{k+1}) - J(\pi_k) \geq 0$. Every update is guaranteed to improve (or at least not degrade) the policy: a monotonic improvement guarantee.
In practice, $C$ is quite large when $\gamma$ is close to 1, making the strict update rule overly conservative. PPO's adaptive KL penalty and clipped objective are best understood as practical approximations that trade the strict guarantee for larger, faster improvement steps.
Looking Ahead
This lecture traced a path from the high-variance REINFORCE estimator to the practical and widely deployed PPO algorithm. We saw how baselines reduce variance without introducing bias, how the advantage function provides a natural and centered measure of action quality, and how actor-critic architectures blend the benefits of policy gradient and value-function methods. The theoretical framework of policy performance bounds and surrogate objectives motivates constraining policy updates, which PPO implements via clipping or adaptive KL penalties.
In the next lecture, we will explore advanced policy gradient topics including Trust Region Policy Optimization (TRPO), natural policy gradients, and imitation learning—extending the policy optimization toolkit to settings where expert demonstrations are available and where we want even stronger guarantees on update quality.