Lecture 6: Policy Gradient II

Recap: REINFORCE and the Variance Problem

In the previous lecture we derived the REINFORCE algorithm, a Monte Carlo policy gradient method that updates policy parameters by ascending the gradient of expected return. The core update rule uses the likelihood ratio trick to express the policy gradient as an expectation under the current policy:

$$\nabla_\theta V(\theta) \approx \frac{1}{m} \sum_{i=1}^{m} \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t^{(i)} \mid s_t^{(i)}) \, G_t^{(i)}$$

where $G_t^{(i)} = \sum_{t'=t}^{T-1} r_{t'}^{(i)}$ is the empirical return from time step $t$ onward in the $i$-th sampled trajectory. The temporal structure improvement (only weighting by future rewards rather than the entire trajectory return) already helps, but this estimator remains unbiased yet extremely high-variance. In practice, the raw REINFORCE estimator is too noisy to be useful without further refinement. This lecture introduces three key ideas for taming that variance: baselines, the advantage function, and the actor-critic architecture.

Introducing a Baseline

The central idea of a baseline is simple: instead of weighting each log-probability gradient by the raw return $G_t$, we subtract a state-dependent function $b(s_t)$ and use $G_t - b(s_t)$ instead. The resulting policy gradient estimator is:

$$\nabla_\theta \mathbb{E}_\tau[R] = \mathbb{E}_\tau\!\left[\sum_{t=0}^{T-1} \nabla_\theta \log \pi(a_t \mid s_t;\, \theta) \left(\sum_{t'=t}^{T-1} r_{t'} - b(s_t)\right)\right]$$

Key Insight

Intuition for why baselines reduce variance. Without a baseline, every action's log-probability is pushed up if the trajectory had positive return and pushed down if it was negative—even if "positive" was merely less terrible than average. By subtracting a baseline that estimates the expected return, we center the gradient signal: actions that led to better than expected outcomes are reinforced, while those leading to worse than expected outcomes are discouraged. The gradient magnitudes shrink, reducing variance, without distorting the direction of improvement.

The Baseline Does Not Introduce Bias

For any function $b(s)$ that depends only on the state, the expected gradient remains unchanged — variance reduction with no bias penalty.

Theorem

Baseline invariance. For any function $b : \mathcal{S} \to \mathbb{R}$ that depends only on the state, the baseline term contributes zero to the expected gradient: $$\mathbb{E}_\tau\!\left[\nabla_\theta \log \pi(a_t \mid s_t;\, \theta)\, b(s_t)\right] = 0$$ Therefore, subtracting $b(s_t)$ from the return does not change the expected value of the policy gradient estimator.

Proof

We break the expectation over the full trajectory into nested expectations. Since $b(s_t)$ depends only on $s_t$ (and the history up to $s_t$), and $\nabla_\theta \log \pi(a_t \mid s_t;\, \theta)$ depends on $a_t$ given $s_t$, we can condition on the history up to state $s_t$ and marginalize over the action:

$$\mathbb{E}_\tau\!\left[\nabla_\theta \log \pi(a_t \mid s_t;\, \theta)\, b(s_t)\right] = \mathbb{E}_{s_{0:t}, a_{0:(t-1)}}\!\left[b(s_t)\, \mathbb{E}_{a_t}\!\left[\nabla_\theta \log \pi(a_t \mid s_t;\, \theta)\right]\right]$$

Now we evaluate the inner expectation over $a_t$. Using the likelihood ratio identity in reverse:

$$\mathbb{E}_{a_t}\!\left[\nabla_\theta \log \pi(a_t \mid s_t;\, \theta)\right] = \sum_a \pi(a \mid s_t;\, \theta)\, \frac{\nabla_\theta \pi(a \mid s_t;\, \theta)}{\pi(a \mid s_t;\, \theta)} = \sum_a \nabla_\theta \pi(a \mid s_t;\, \theta) = \nabla_\theta \sum_a \pi(a \mid s_t;\, \theta) = \nabla_\theta\, 1 = 0$$

Since the inner expectation is zero regardless of $s_t$, the entire expression vanishes. $\square$

The Optimal Baseline

Since any baseline preserves unbiasedness, we should choose the one that minimizes variance. The variance of a single gradient term is:

$$\text{Var}\!\left[\nabla_\theta \log \pi(a_t \mid s_t;\, \theta)\,(G_t - b(s_t))\right]$$

Minimizing over $b$ leads to a weighted least-squares problem. Taking the derivative with respect to $b(s)$ and setting it to zero yields the optimal baseline:

$$b^*(s) = \frac{\mathbb{E}_{a \sim \pi(\cdot|s),\, G \mid s, a}\!\left[\|\nabla_\theta \log \pi(a \mid s;\, \theta)\|^2\, G_t(s)\right]}{\mathbb{E}_{a \sim \pi(\cdot|s),\, G \mid s, a}\!\left[\|\nabla_\theta \log \pi(a \mid s;\, \theta)\|^2\right]}$$

This is a weighted average of returns, with weights given by the squared score-function magnitudes. In practice those weights are roughly constant across actions, so the optimal baseline simplifies to:

$$b^*(s) \approx \mathbb{E}[G_t \mid s_t = s] = V^\pi(s)$$

This gives strong motivation for using $V^\pi(s)$ as the baseline — it is near-optimal and has a clean interpretation.

The Advantage Function

When we use $V^\pi(s)$ as the baseline and $Q^\pi(s, a)$ in place of the Monte Carlo return, the resulting quantity has a name and a deep significance.

Definition

Advantage Function. The advantage of taking action $a$ in state $s$ under policy $\pi$ is: $$A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)$$ It measures how much better (or worse) action $a$ is compared to the average action under $\pi$ at state $s$.

With this definition, the policy gradient takes the elegant form:

$$\nabla_\theta \mathbb{E}_\tau[R] \approx \mathbb{E}_\tau\!\left[\sum_{t=0}^{T-1} \nabla_\theta \log \pi(a_t \mid s_t;\, \theta)\, A^\pi(s_t, a_t)\right]$$

Key Insight

Why the advantage function works. Because $\mathbb{E}_{a \sim \pi}[A^\pi(s, a)] = 0$ for every state $s$, the advantage is automatically centered. Actions better than average have positive advantage and get reinforced; actions worse than average have negative advantage and get suppressed. This centering is exactly what makes the gradient estimator low-variance compared to using raw returns.

Recall the definitions of $Q^\pi$ and $V^\pi$:

$$Q^\pi(s, a) = \mathbb{E}_\pi\!\left[r_0 + \gamma r_1 + \gamma^2 r_2 + \cdots \mid s_0 = s,\, a_0 = a\right]$$ $$V^\pi(s) = \mathbb{E}_\pi\!\left[r_0 + \gamma r_1 + \gamma^2 r_2 + \cdots \mid s_0 = s\right] = \mathbb{E}_{a \sim \pi}[Q^\pi(s, a)]$$

Vanilla Policy Gradient with Baseline

Putting together temporal structure, the baseline, and the advantage function, we arrive at the "vanilla" policy gradient algorithm — the practical starting point for most policy gradient implementations.

REINFORCE with Baseline

Initialize policy parameters $\theta$ and baseline $b$.
for iteration $= 1, 2, \ldots$ do
Collect a set of trajectories $\{\tau^{(i)}\}$ by executing the current policy $\pi_\theta$.
for each timestep $t$ in each trajectory $\tau^{(i)}$ do
Compute the return: $G_t^{(i)} = \sum_{t'=t}^{T-1} r_{t'}^{(i)}$
Compute the advantage estimate: $\hat{A}_t^{(i)} = G_t^{(i)} - b(s_t^{(i)})$
end for
Re-fit the baseline by minimizing $\sum_{i,t} |b(s_t^{(i)}) - G_t^{(i)}|^2$.
Compute the policy gradient estimate: $\hat{g} = \sum_{i,t} \nabla_\theta \log \pi(a_t \mid s_t;\, \theta)\, \hat{A}_t$.
Update $\theta$ using $\hat{g}$ via SGD or Adam.
end for

This already offers a substantial improvement over bare REINFORCE, but it still relies on Monte Carlo returns $G_t$, which are high-variance because they accumulate randomness over an entire trajectory from time $t$ onward. Can we do better?

Actor-Critic Methods

The Monte Carlo return $G_t$ is an unbiased but high-variance estimate of $Q^\pi(s_t, a_t)$. Just as TD methods offered a lower-variance alternative to MC estimation in value-based RL, we can replace $G_t$ with a learned estimate of the value function. This introduces some bias from function approximation but can dramatically reduce variance.

Definition

Actor-Critic. An actor-critic method maintains two components:

The actor: a parameterized policy $\pi_\theta(a \mid s)$ that selects actions.
The critic: a learned value function $V_w(s)$ (or $Q_w(s, a)$) that evaluates how good the current state (or state-action pair) is.

The actor is updated using policy gradient ascent, with the advantage estimated from the critic. The critic is updated by minimizing a value-function approximation error (e.g., via TD learning).

TD Error as Advantage Estimate

The simplest actor-critic uses the one-step TD error as an estimate of the advantage:

$$\delta_t = r_t + \gamma V_w(s_{t+1}) - V_w(s_t)$$

This is a biased but low-variance estimate of $A^\pi(s_t, a_t)$. When the critic is perfect ($V_w = V^\pi$), the expectation recovers the advantage exactly:

$$\mathbb{E}[\delta_t \mid s_t, a_t] = \mathbb{E}[r_t + \gamma V^\pi(s_{t+1}) \mid s_t, a_t] - V^\pi(s_t) = Q^\pi(s_t, a_t) - V^\pi(s_t) = A^\pi(s_t, a_t)$$

In practice, function approximation introduces some bias, but the variance reduction is usually worth it.

Advantage Actor-Critic (A2C)

Initialize policy parameters $\theta$ and critic parameters $w$.
for each episode do
for $t = 0, 1, \ldots, T-1$ do
Take action $a_t \sim \pi_\theta(\cdot \mid s_t)$, observe $r_t, s_{t+1}$.
Compute TD error: $\delta_t = r_t + \gamma V_w(s_{t+1}) - V_w(s_t)$.
Update critic: $w \leftarrow w + \alpha_w\, \delta_t\, \nabla_w V_w(s_t)$.
Update actor: $\theta \leftarrow \theta + \alpha_\theta\, \delta_t\, \nabla_\theta \log \pi_\theta(a_t \mid s_t)$.
end for
end for

The critic uses the TD error to adjust its value estimates, while the actor uses the same TD error as a stand-in for the advantage to adjust the policy. This interleaving of policy improvement and policy evaluation is characteristic of all actor-critic methods.

N-Step Advantage Estimators

The one-step TD error sits at one extreme of a spectrum of advantage estimators. At the other extreme is the full Monte Carlo return. In between lie the $n$-step estimators, which blend bootstrapping and sampling:

$$\hat{R}_t^{(1)} = r_t + \gamma V(s_{t+1})$$ $$\hat{R}_t^{(2)} = r_t + \gamma r_{t+1} + \gamma^2 V(s_{t+2})$$ $$\vdots$$ $$\hat{R}_t^{(\infty)} = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots$$

Subtracting the baseline $V(s_t)$ from each gives the corresponding advantage estimators:

$$\hat{A}_t^{(1)} = r_t + \gamma V(s_{t+1}) - V(s_t) \quad \text{(TD error, low variance, higher bias)}$$ $$\hat{A}_t^{(\infty)} = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots - V(s_t) \quad \text{(MC, high variance, low bias)}$$

The one-step estimator $\hat{A}_t^{(1)}$ has low variance because it depends on only one step of randomness, but it carries bias from the critic's approximation error. The Monte Carlo estimator $\hat{A}_t^{(\infty)}$ is unbiased (no bootstrapping) but has high variance because it accumulates randomness over the entire trajectory. Intermediate values of $n$ offer a smooth trade-off, and in practice methods like Generalized Advantage Estimation (GAE) take a weighted combination across all $n$-step returns.

Problems with Vanilla Policy Gradients

Even with baselines and actor-critic enhancements, vanilla policy gradient methods suffer from two fundamental issues:

Poor sample efficiency. Policy gradients are inherently on-policy: each batch of data is used for a single gradient step and then discarded. The gradient estimate $\hat{g}$ is only valid for the current policy $\pi_\theta$; once $\theta$ changes, the old data no longer provides an unbiased estimate. Importance sampling can in principle reuse old data, but the trajectory-level importance weights tend to vanish or explode:

$$\frac{P(\tau_t \mid \theta)}{P(\tau_t \mid \theta')} = \prod_{t'=0}^{t} \frac{\pi_\theta(a_{t'} \mid s_{t'} )}{\pi_{\theta'}(a_{t'} \mid s_{t'})}$$

Even for policies that are only slightly different, many small ratios multiply together to produce extreme values, making the estimator impractical for long horizons.

Sensitivity to step size. Policy gradient algorithms perform stochastic gradient ascent: $\theta_{k+1} = \theta_k + \alpha_k \hat{g}_k$. If $\alpha_k$ is too large, performance can collapse — the policy may enter a region of parameter space from which recovery is difficult. If too small, progress is unacceptably slow. The fundamental issue is that distance in parameter space does not correspond to distance in policy space: a small change in $\theta$ can produce a dramatic change in $\pi_\theta$ (near a sigmoid saturation point, for instance), and vice versa.

Policy Performance Bounds

To design update rules that respect distance in policy space, the key tool is the performance difference lemma.

Theorem

Performance Difference Lemma. For any two policies $\pi$ and $\pi'$: $$J(\pi') - J(\pi) = \mathbb{E}_{\tau \sim \pi'}\!\left[\sum_{t=0}^{\infty} \gamma^t A^\pi(s_t, a_t)\right] = \frac{1}{1-\gamma}\, \mathbb{E}_{\substack{s \sim d^{\pi'} \\ a \sim \pi'}}\!\left[A^\pi(s, a)\right]$$ where $d^{\pi'}(s) = (1-\gamma)\sum_{t=0}^{\infty} \gamma^t P(s_t = s \mid \pi')$ is the discounted state visitation distribution under $\pi'$.

The performance difference is expressed entirely in terms of the advantage of the old policy $\pi$, evaluated under the state-action distribution of the new policy $\pi'$. The catch: it still requires samples from $\pi'$, which is the policy we are trying to optimize.

The Surrogate Objective

To make this tractable, we approximate $d^{\pi'} \approx d^{\pi}$ and define the surrogate objective:

$$L_\pi(\pi') = \frac{1}{1-\gamma}\, \mathbb{E}_{\substack{s \sim d^{\pi} \\ a \sim \pi}}\!\left[\frac{\pi'(a \mid s)}{\pi(a \mid s)}\, A^\pi(s, a)\right]$$

This can be estimated from old-policy trajectories: the importance weight $\pi'/\pi$ applies only at the action level, not across entire trajectories. A formal bound (Achiam et al., 2017) guarantees:

$$\left|J(\pi') - J(\pi) - L_\pi(\pi')\right| \leq C\, \sqrt{\mathbb{E}_{s \sim d^\pi}\!\left[D_{\mathrm{KL}}(\pi' \| \pi)[s]\right]}$$

where $C$ depends on the discount factor and the range of the advantage. When $\pi'$ and $\pi$ are close in KL-divergence, the surrogate $L_\pi(\pi')$ faithfully approximates the true performance difference.

Proximal Policy Optimization (PPO)

The bound suggests a natural strategy: maximize $L_\pi(\pi')$ while keeping the KL-divergence small. Proximal Policy Optimization (PPO) implements this in two practical variants.

Variant 1: Adaptive KL Penalty

The first variant adds KL-divergence as a penalty term in an unconstrained problem:

$$\theta_{k+1} = \arg\max_\theta\; L_{\theta_k}(\theta) - \beta_k\, \bar{D}_{\mathrm{KL}}(\theta \| \theta_k)$$

where $\bar{D}_{\mathrm{KL}}(\theta \| \theta_k) = \mathbb{E}_{s \sim d^{\pi_k}}[D_{\mathrm{KL}}(\pi_{\theta_k}(\cdot|s) \,\|\, \pi_\theta(\cdot|s))]$. The penalty coefficient $\beta_k$ adapts between iterations to approximately enforce a target KL-divergence $\delta$:

If $\bar{D}_{\mathrm{KL}}(\theta_{k+1} \| \theta_k) \geq 1.5\,\delta$, the step was too large: double $\beta_{k+1} = 2\beta_k$.
If $\bar{D}_{\mathrm{KL}}(\theta_{k+1} \| \theta_k) \leq \delta / 1.5$, the step was conservative: halve $\beta_{k+1} = \beta_k / 2$.

This adaptive scheme means the initial choice of $\beta_0$ is not critical — the coefficient quickly self-calibrates.

Variant 2: Clipped Objective

The second and more popular variant avoids computing KL-divergence entirely. It defines the probability ratio $r_t(\theta) = \pi_\theta(a_t \mid s_t) / \pi_{\theta_k}(a_t \mid s_t)$ and clips it to prevent large policy changes:

$$L_{\theta_k}^{\mathrm{CLIP}}(\theta) = \mathbb{E}_{\tau \sim \pi_k}\!\left[\sum_{t=0}^{T} \min\!\Big(r_t(\theta)\, \hat{A}_t^{\pi_k},\; \mathrm{clip}\big(r_t(\theta),\, 1 - \epsilon,\, 1 + \epsilon\big)\, \hat{A}_t^{\pi_k}\Big)\right]$$

where $\epsilon$ is a hyperparameter (typically $\epsilon = 0.2$). The clipping works as follows:

When $\hat{A}_t > 0$ (the action was better than expected), $r_t(\theta)$ is clipped from above at $1 + \epsilon$. The objective has no incentive to increase the probability ratio beyond $1 + \epsilon$.
When $\hat{A}_t < 0$ (the action was worse than expected), $r_t(\theta)$ is clipped from below at $1 - \epsilon$. The objective has no incentive to decrease the probability ratio beyond $1 - \epsilon$.

In both cases, the minimum selects the more pessimistic estimate, preventing the optimization from exploiting policy changes large enough to invalidate the surrogate objective.

PPO with Clipped Objective

Initialize policy parameters $\theta_0$ and clipping threshold $\epsilon$.
for $k = 0, 1, 2, \ldots$ do
Collect a set of partial trajectories $\mathcal{D}_k$ under policy $\pi_k = \pi(\theta_k)$.
Estimate advantages $\hat{A}_t^{\pi_k}$ using any advantage estimation algorithm.
Compute policy update: $$\theta_{k+1} = \arg\max_\theta\; L_{\theta_k}^{\mathrm{CLIP}}(\theta)$$ by taking $K$ steps of minibatch SGD (via Adam).
end for

The clipped objective is simpler to implement than the KL-penalty variant and performs at least as well empirically. Crucially, $K$ gradient steps can be taken per batch of data, improving sample efficiency over vanilla policy gradients.

Key Insight

Why PPO enables multiple gradient steps. Vanilla policy gradients discard each batch after one step because the gradient estimate is only valid for the current policy. PPO's clipped objective automatically stops extracting useful signal once the new policy has moved too far from the data-collection policy, so multiple gradient steps per batch are safe. This is a major practical advantage.

Empirical Performance

PPO has become one of the most widely used deep RL algorithms. It demonstrates strong performance across a range of continuous-control benchmarks (e.g., MuJoCo locomotion tasks) while being considerably simpler to implement and tune than trust-region methods like TRPO. Notably, PPO was a key algorithmic component in the training of ChatGPT and other RLHF-based systems, where it is used to fine-tune language models against a learned reward model of human preferences. Implementation details like reward scaling, learning rate annealing, and advantage normalization can significantly affect performance (Engstrom et al., ICLR 2020).

Monotonic Improvement Theory

The surrogate-plus-bound framework leads to a powerful guarantee. From the performance bound:

$$J(\pi') - J(\pi) \geq L_\pi(\pi') - C\, \sqrt{\mathbb{E}_{s \sim d^\pi}\!\left[D_{\mathrm{KL}}(\pi' \| \pi)[s]\right]}$$

Define the update rule as:

$$\pi_{k+1} = \arg\max_{\pi'}\; L_{\pi_k}(\pi') - C\, \sqrt{\mathbb{E}_{s \sim d^{\pi_k}}\!\left[D_{\mathrm{KL}}(\pi' \| \pi_k)[s]\right]}$$

Since $\pi_k$ itself is always a feasible point with objective value zero ($L_{\pi_k}(\pi_k) = 0$ and $D_{\mathrm{KL}}(\pi_k \| \pi_k) = 0$), the optimal value must be non-negative, and the performance bound then gives $J(\pi_{k+1}) - J(\pi_k) \geq 0$. Every update is guaranteed to improve (or at least not degrade) the policy: a monotonic improvement guarantee.

In practice, $C$ is quite large when $\gamma$ is close to 1, making the strict update rule overly conservative. PPO's adaptive KL penalty and clipped objective are best understood as practical approximations that trade the strict guarantee for larger, faster improvement steps.

Looking Ahead

This lecture traced a path from the high-variance REINFORCE estimator to the practical and widely deployed PPO algorithm. We saw how baselines reduce variance without introducing bias, how the advantage function provides a natural and centered measure of action quality, and how actor-critic architectures blend the benefits of policy gradient and value-function methods. The theoretical framework of policy performance bounds and surrogate objectives motivates constraining policy updates, which PPO implements via clipping or adaptive KL penalties.

In the next lecture, we will explore advanced policy gradient topics including Trust Region Policy Optimization (TRPO), natural policy gradients, and imitation learning—extending the policy optimization toolkit to settings where expert demonstrations are available and where we want even stronger guarantees on update quality.