Lecture 5

Policy Gradient I

Moving from value-based to policy-based methods: the policy gradient theorem and REINFORCE.

Policy Gradient REINFORCE Score Function Likelihood Ratio
Original PDF slides

From Value-Based Methods to Policy Search

In the preceding lectures we developed a suite of value-based methods for reinforcement learning—tabular TD, Monte Carlo, function approximation, and deep Q-networks (DQN). The core idea was always the same: learn $Q^\pi(s, a)$ or $V^\pi(s)$, then extract a policy implicitly (e.g., act $\epsilon$-greedily with respect to the learned Q-values). This approach yielded impressive results, including DQN's Atari breakthrough (Mnih et al., 2015), but it also revealed fundamental limitations.

Recall the "deadly triad" that plagues value-based methods with function approximation: (1) bootstrapping, using our own value estimates as targets; (2) function approximation, which can magnify errors; and (3) off-policy learning (as in Q-learning), where the data distribution differs from the policy being evaluated. When all three are present, convergence is no longer guaranteed, and the system can oscillate or diverge.

DQN: Successes and Stabilizing Tricks

DQN addressed two pressing problems with naive Q-learning on neural networks. First, experience replay stores transitions $(s, a, r, s')$ in a buffer and samples random mini-batches, breaking temporal correlations between consecutive samples. Second, fixed Q-targets maintain a separate weight copy $w^-$ updated only every $C$ steps, stabilizing the moving target problem. The DQN update:

$$\Delta w = \alpha \bigl(r + \gamma \max_{a'} \hat{Q}(s', a'; w^-) - \hat{Q}(s, a; w)\bigr) \nabla_w \hat{Q}(s, a; w)$$

Ablation studies on Atari showed experience replay was by far the most important ingredient—a dramatic boost across nearly all games. Fixed Q-targets help further with stability. Nevertheless, value-based methods face structural limitations that motivate a fundamentally different approach.

Why search directly over policies rather than learning value functions?

Key Insight
The aliased gridworld. Consider a gridworld where two distinct states look identical to the agent (same features). An optimal deterministic policy must take the same action in both aliased states—which can trap the agent bouncing back and forth in a corridor. A stochastic policy can assign equal probability to going left and right, letting the agent escape and reach the goal. Value-based methods produce near-deterministic policies and fail here; policy-based methods can learn the optimal stochastic solution.

Policy Parameterization

In policy-based RL, we directly parameterize the policy with a vector $\theta$. The policy $\pi_\theta(a \mid s)$ gives the probability of selecting action $a$ in state $s$, and our goal is to find $\theta$ that maximizes the policy's value.

Definition
Parameterized Policy. A stochastic policy $\pi_\theta(a \mid s) = P[a \mid s;\, \theta]$ maps states to probability distributions over actions, where $\theta \in \mathbb{R}^n$ is a tunable parameter vector. We require $\pi_\theta$ to be differentiable with respect to $\theta$ wherever it is non-zero.

This contrasts with value-based RL, where we parameterized $\hat{Q}_w(s, a)$ and derived the policy indirectly. The taxonomy of RL agents now has three categories:

Softmax Policy (Discrete Actions)

For discrete action spaces, a natural parameterization is the softmax policy. Each state-action pair gets a score $\phi(s, a)^\top \theta$, converted to probabilities via softmax:

$$\pi_\theta(s, a) = \frac{e^{\phi(s, a)^\top \theta}}{\sum_{a'} e^{\phi(s, a')^\top \theta}}$$

The score function (defined precisely below) for the softmax policy has an elegant form:

$$\nabla_\theta \log \pi_\theta(s, a) = \phi(s, a) - \mathbb{E}_{\pi_\theta}[\phi(s, \cdot)]$$

The gradient points toward the features of the chosen action minus the expected features under the current policy. Actions whose features deviate most from the average get the strongest gradient signal.

Gaussian Policy (Continuous Actions)

In continuous action spaces, a Gaussian policy is the natural choice. The mean is a linear function of state features, $\mu(s) = \phi(s)^\top \theta$, and the variance $\sigma^2$ can be fixed or learned. Actions are sampled as:

$$a \sim \mathcal{N}(\mu(s),\, \sigma^2)$$

The score function for the Gaussian policy takes the form:

$$\nabla_\theta \log \pi_\theta(s, a) = \frac{(a - \mu(s))\, \phi(s)}{\sigma^2}$$

The gradient is proportional to how far the sampled action deviates from the mean, scaled by the feature vector. If a surprisingly good action was sampled far from the mean, the gradient pushes the mean toward it. More expressive approximators like deep neural networks can also represent the policy—what matters is that we can compute $\nabla_\theta \log \pi_\theta(a \mid s)$.

The Policy Gradient Objective

Policy-based RL is an optimization problem. We seek $\theta$ that maximizes the policy value:

$$J(\theta) = V(s_0, \theta) = \mathbb{E}_{\pi_\theta}\!\left[\sum_{t=0}^{T} R(s_t, a_t) \;\middle|\; \pi_\theta,\, s_0\right]$$

where the expectation is over states and actions visited under $\pi_\theta$ from $s_0$. Equivalently, summing over complete trajectories $\tau = (s_0, a_0, r_0, \ldots, s_{T-1}, a_{T-1}, r_{T-1}, s_T)$:

$$J(\theta) = \sum_{\tau} P(\tau;\, \theta)\, R(\tau) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]$$

where $P(\tau;\, \theta)$ is the probability of trajectory $\tau$ under $\pi_\theta$, and $R(\tau) = \sum_{t=0}^{T} R(s_t, a_t)$ is the total reward along that trajectory. We optimize this objective via gradient ascent:

$$\theta \leftarrow \theta + \alpha\, \nabla_\theta J(\theta)$$

The central question: how do we compute $\nabla_\theta J(\theta)$? The value depends on $\theta$ through a chain of policy choices, stochastic transitions, and accumulated rewards over an entire trajectory. Computing this gradient analytically seems daunting—but the likelihood ratio trick provides an elegant solution.

The Likelihood Ratio Trick

The key insight behind policy gradients is the log-derivative trick (also called the likelihood ratio trick or REINFORCE trick). Starting from the gradient of the objective:

$$\nabla_\theta J(\theta) = \nabla_\theta \sum_{\tau} P(\tau;\, \theta)\, R(\tau) = \sum_{\tau} \nabla_\theta P(\tau;\, \theta)\, R(\tau)$$

We multiply and divide by $P(\tau;\, \theta)$ to introduce the identity $\nabla_\theta P(\tau;\, \theta) = P(\tau;\, \theta)\, \nabla_\theta \log P(\tau;\, \theta)$:

$$\nabla_\theta J(\theta) = \sum_{\tau} P(\tau;\, \theta) \cdot \frac{\nabla_\theta P(\tau;\, \theta)}{P(\tau;\, \theta)} \cdot R(\tau) = \sum_{\tau} P(\tau;\, \theta)\, R(\tau)\, \nabla_\theta \log P(\tau;\, \theta)$$

This is now an expectation under the trajectory distribution $P(\tau;\, \theta)$:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[R(\tau)\, \nabla_\theta \log P(\tau;\, \theta)\right]$$

which we can approximate by sampling $m$ trajectories from $\pi_\theta$:

$$\nabla_\theta J(\theta) \approx \hat{g} = \frac{1}{m} \sum_{i=1}^{m} R(\tau^{(i)})\, \nabla_\theta \log P(\tau^{(i)};\, \theta)$$
Key Insight
Why the log-derivative trick works. Consider the generic form $\hat{g}_i = f(x_i)\, \nabla_\theta \log p(x_i \mid \theta)$. The quantity $f(x)$ measures how good sample $x$ is. Moving in the direction $\hat{g}_i$ increases the log-probability of the sample in proportion to how good it is. This estimator is valid even when $f(x)$ is discontinuous or unknown, and even when the sample space is discrete. We never need to differentiate the reward function or the dynamics—only the policy.

Decomposing the Trajectory Probability

Next we decompose $\nabla_\theta \log P(\tau;\, \theta)$ and show that we do not need the environment dynamics. The trajectory probability factors as:

$$P(\tau;\, \theta) = \mu(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t \mid s_t)\, P(s_{t+1} \mid s_t, a_t)$$

where $\mu(s_0)$ is the initial state distribution. Taking the log and then the gradient with respect to $\theta$:

$$\nabla_\theta \log P(\tau;\, \theta) = \underbrace{\nabla_\theta \log \mu(s_0)}_{= 0} + \sum_{t=0}^{T-1} \left[\nabla_\theta \log \pi_\theta(a_t \mid s_t) + \underbrace{\nabla_\theta \log P(s_{t+1} \mid s_t, a_t)}_{= 0}\right]$$

The initial state distribution and transition dynamics do not depend on $\theta$, so their gradients vanish:

$$\nabla_\theta \log P(\tau;\, \theta) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \mid s_t)$$

A remarkable result: the gradient of the log-trajectory-probability depends only on the policy, not on the dynamics model. Policy gradient methods are inherently model-free.

Definition
Score Function. The score function of a parameterized policy is $\nabla_\theta \log \pi_\theta(a \mid s)$—the gradient of the log-probability of taking action $a$ in state $s$. It measures how sensitive the policy's log-likelihood is to changes in $\theta$.

Substituting back into our gradient estimate:

$$\nabla_\theta J(\theta) \approx \frac{1}{m} \sum_{i=1}^{m} R(\tau^{(i)}) \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t^{(i)} \mid s_t^{(i)})$$

This is the likelihood ratio policy gradient estimator: unbiased, model-free, and applicable to any differentiable policy—but it can suffer from high variance, which we address shortly.

The Policy Gradient Theorem

The derivation above used the trajectory-level formulation for episodic tasks. The policy gradient theorem gives a more general statement that applies across multiple objective functions.

Theorem
Policy Gradient Theorem. For any differentiable policy $\pi_\theta(s, a)$ and any of the standard policy objectives—episodic reward $J_1$, average reward per time step $J_{\text{avR}}$, or average value $\frac{1}{1-\gamma} J_{\text{avV}}$—the policy gradient takes the form: $$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\!\left[\nabla_\theta \log \pi_\theta(s, a)\, Q^{\pi_\theta}(s, a)\right]$$
Proof: Policy Gradient Theorem (Episodic Case)

We sketch the proof for the episodic case with discrete states, following Sutton and Barto (2018), Chapter 13.2. Start from the value of the initial state:

$$J(\theta) = V^{\pi_\theta}(s_0) = \sum_a \pi_\theta(a \mid s_0)\, Q^{\pi_\theta}(s_0, a)$$

Taking the gradient:

$$\nabla_\theta V^{\pi_\theta}(s_0) = \sum_a \left[\nabla_\theta \pi_\theta(a \mid s_0)\, Q^{\pi_\theta}(s_0, a) + \pi_\theta(a \mid s_0)\, \nabla_\theta Q^{\pi_\theta}(s_0, a)\right]$$

Now expand $Q^{\pi_\theta}(s_0, a) = R(s_0, a) + \gamma \sum_{s'} P(s' \mid s_0, a)\, V^{\pi_\theta}(s')$. Since $R(s_0, a)$ and $P(s' \mid s_0, a)$ do not depend on $\theta$, we have:

$$\nabla_\theta Q^{\pi_\theta}(s_0, a) = \gamma \sum_{s'} P(s' \mid s_0, a)\, \nabla_\theta V^{\pi_\theta}(s')$$

Substituting and unrolling the recursion introduces a sum over all states weighted by the probability of reaching each from $s_0$ under $\pi_\theta$. Defining the (unnormalized) discounted state visitation frequency $d^{\pi_\theta}(s) = \sum_{t=0}^{T-1} \gamma^t P(s_t = s \mid s_0, \pi_\theta)$, the result telescopes to:

$$\nabla_\theta J(\theta) = \sum_s d^{\pi_\theta}(s) \sum_a \nabla_\theta \pi_\theta(a \mid s)\, Q^{\pi_\theta}(s, a)$$

Applying the log-derivative identity $\nabla_\theta \pi_\theta(a \mid s) = \pi_\theta(a \mid s)\, \nabla_\theta \log \pi_\theta(a \mid s)$, and recognizing the resulting double sum as an expectation under the state-action distribution induced by $\pi_\theta$, we obtain:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\!\left[\nabla_\theta \log \pi_\theta(a \mid s)\, Q^{\pi_\theta}(s, a)\right] \qquad \blacksquare$$

The gradient depends on $Q^{\pi_\theta}$, the action-value function of the current policy—no access to environment dynamics is needed, only samples generated by the policy itself.

Exploiting Temporal Structure: REINFORCE

The basic likelihood ratio estimator weights the entire trajectory return $R(\tau)$ by the sum of all score functions along the trajectory. This means the reward at $t=0$ influences the gradient at $t=T-1$, even though a past reward cannot be affected by a future action—introducing unnecessary variance.

We can do better by exploiting causal structure. The gradient of the expected reward at time step $t'$ depends only on policy choices up to $t'$:

$$\nabla_\theta \mathbb{E}[r_{t'}] = \mathbb{E}\!\left[r_{t'} \sum_{t=0}^{t'} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\right]$$

Summing over all time steps and swapping the order of summation, each policy gradient at time $t$ gets paired with only future rewards from $t$ onward:

$$\nabla_\theta J(\theta) = \mathbb{E}\!\left[\sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \sum_{t'=t}^{T-1} r_{t'}\right] = \mathbb{E}\!\left[\sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, G_t\right]$$

where $G_t = \sum_{t'=t}^{T-1} r_{t'}$ is the return (undiscounted, i.e. $\gamma = 1$, appropriate for the finite-horizon episodic setting) from time step $t$. This gives us the REINFORCE algorithm.

REINFORCE (Monte Carlo Policy Gradient)
  1. Initialize policy parameters $\theta$ arbitrarily.
  2. for each episode do:
  3. Generate trajectory $\{s_0, a_0, r_0, \ldots, s_{T-1}, a_{T-1}, r_{T-1}\} \sim \pi_\theta$
  4. for $t = 0$ to $T-1$ do:
  5. Compute return: $G_t = \sum_{t'=t}^{T-1} r_{t'}$
  6. Update parameters: $\theta \leftarrow \theta + \alpha\, \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, G_t$
  7. end for
  8. end for
  9. return $\theta$

REINFORCE is conceptually simple: roll out complete episodes, and for each action, nudge the parameters to increase its log-probability in proportion to the future reward that followed. Good actions (high $G_t$) get reinforced; poor actions get suppressed. The estimator is unbiased but can have high variance, since $G_t$ is a single Monte Carlo sample of the expected return.

Variance Reduction with Baselines

The raw REINFORCE estimator, while unbiased, can be very noisy. A powerful technique for reducing variance without introducing bias is to subtract a baseline $b(s)$ from the return:

$$\nabla_\theta J(\theta) = \mathbb{E}\!\left[\sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \left(G_t - b(s_t)\right)\right]$$

Any baseline that depends only on the state (not on the action) leaves the gradient unbiased.

Proof: Baselines Do Not Introduce Bias

We need to show that $\mathbb{E}_\tau[\nabla_\theta \log \pi_\theta(a_t \mid s_t)\, b(s_t)] = 0$. Decompose the expectation by conditioning on the history up to time $t$:

$$\mathbb{E}_\tau[\nabla_\theta \log \pi_\theta(a_t \mid s_t)\, b(s_t)] = \mathbb{E}_{s_{0:t}, a_{0:t-1}}\!\left[b(s_t)\, \mathbb{E}_{a_t}[\nabla_\theta \log \pi_\theta(a_t \mid s_t)]\right]$$

Now evaluate the inner expectation over $a_t$:

$$\mathbb{E}_{a_t}[\nabla_\theta \log \pi_\theta(a_t \mid s_t)] = \sum_a \pi_\theta(a \mid s_t)\, \frac{\nabla_\theta \pi_\theta(a \mid s_t)}{\pi_\theta(a \mid s_t)} = \sum_a \nabla_\theta \pi_\theta(a \mid s_t) = \nabla_\theta \sum_a \pi_\theta(a \mid s_t) = \nabla_\theta\, 1 = 0$$

Since the inner expectation vanishes, the entire expression is zero regardless of the choice of $b(s_t)$. $\blacksquare$

A near-optimal baseline is the expected return from state $s_t$, which is simply the value function:

$$b(s_t) \approx \mathbb{E}[G_t \mid s_t] = V^{\pi_\theta}(s_t)$$

By subtracting the baseline, we convert the return into an advantage—how much better (or worse) the chosen action turned out compared to average performance from that state. Above-average actions get reinforced; below-average actions get suppressed. This eliminates the problem where all returns are positive (or all negative), pushing all log-probabilities in the same direction regardless of action quality.

Definition
Advantage Function. The advantage of taking action $a$ in state $s$ under policy $\pi$ is: $$A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)$$ It measures how much better action $a$ is compared to the average action under $\pi$. By construction, $\mathbb{E}_{a \sim \pi}[A^\pi(s, a)] = 0$.

The Vanilla Policy Gradient Algorithm

Putting together the score function estimator, temporal structure, and baselines, we arrive at the practical "vanilla" policy gradient algorithm:

Vanilla Policy Gradient with Baseline
  1. Initialize policy parameters $\theta$ and baseline $b$ (e.g., a value function approximation).
  2. for iteration $= 1, 2, \ldots$ do:
  3. Collect a set of trajectories $\{\tau^{(i)}\}$ by executing the current policy $\pi_\theta$.
  4. for each timestep $t$ in each trajectory $\tau^{(i)}$ do:
  5. Compute return: $G_t^{(i)} = \sum_{t'=t}^{T-1} r_{t'}^{(i)}$
  6. Compute advantage estimate: $\hat{A}_t^{(i)} = G_t^{(i)} - b(s_t)$
  7. end for
  8. Re-fit baseline by minimizing $\sum_i \sum_t \|b(s_t) - G_t^{(i)}\|^2$.
  9. Compute policy gradient estimate: $\hat{g} = \frac{1}{m} \sum_i \sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, \hat{A}_t^{(i)}$
  10. Update $\theta$ using $\hat{g}$ with SGD or Adam.
  11. end for

Toward Actor-Critic Methods

In the vanilla algorithm above, we used the Monte Carlo return $G_t$ as our estimate of $Q^{\pi_\theta}(s_t, a_t)$—unbiased but high-variance, since it depends on the entire stochastic future of a single trajectory. Just as with TD versus MC for value estimation, we can trade bias for variance by introducing bootstrapping: using a learned value function to estimate part of the return.

The general policy gradient estimator is:

$$\nabla_\theta J(\theta) \approx \frac{1}{m} \sum_{i=1}^{m} \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t^{(i)} \mid s_t^{(i)})\, \hat{A}_t^{(i)}$$

where $\hat{A}_t$ can be any of a family of advantage estimators. Using an estimated value function $V_w(s)$ as both the baseline and a bootstrap target, we get $n$-step advantage estimators:

The 1-step estimator has low variance but high bias (it relies heavily on $V_w$'s accuracy). The $\infty$-step (MC) estimator has zero bias but high variance. Intermediate $n$ provides a smooth trade-off.

Key Insight
The bias-variance knob. The choice of $n$ in the $n$-step estimator controls a fundamental trade-off. Short-horizon estimates ($n=1$) bootstrap aggressively, producing low-variance but potentially biased gradient estimates. Long-horizon estimates ($n = \infty$) use Monte Carlo returns, which are unbiased but noisy. In practice, intermediate values or weighted blends (like GAE/$\lambda$-returns) often work best.

When we maintain both a parameterized policy $\pi_\theta$ (the actor) and a parameterized value function $V_w$ (the critic), we have an actor-critic method. The critic estimates $V^{\pi_\theta}$; the actor improves the policy using the critic's estimates in the gradient. Popular algorithms like A3C (Mnih et al., ICML 2016) follow this paradigm.

Summary and Looking Ahead

This lecture introduced a different paradigm for reinforcement learning: instead of learning a value function and deriving a policy, we directly parameterize and optimize the policy. Key concepts:

Policy gradient methods form the backbone of algorithms like REINFORCE, PPO (Proximal Policy Optimization), and TRPO (Trust Region Policy Optimization). They have been applied to robotic locomotion, game-playing agents, and—perhaps most prominently—aligning large language models through RLHF. In the next lecture, we will explore actor-critic architectures, trust region methods, and advanced variance reduction techniques that make policy gradients practical at scale.