Lecture 7

PPO, GAE & Imitation Learning

Trust regions, proximal policy optimization, generalized advantage estimation, and an introduction to learning from demonstrations.

TRPO PPO GAE Imitation Learning
Original PDF slides

Problems with Vanilla Policy Gradients

Policy gradient algorithms optimize the objective $\max_\theta J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_{t=0}^{\infty} \gamma^t r_t\right]$ by taking stochastic gradient ascent steps using the policy gradient

$$g = \nabla_\theta J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_{t=0}^{\infty} \gamma^t \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, A^{\pi_\theta}(s_t, a_t)\right].$$

While conceptually elegant, vanilla policy gradients suffer from two critical limitations:

Key Insight

For tabular policies, the policy space is the set of stochastic matrices $\Pi = \{\pi \in \mathbb{R}^{|\mathcal{S}| \times |\mathcal{A}|} : \sum_a \pi_{sa} = 1,\; \pi_{sa} \geq 0\}$. Policy gradients take steps in parameter space, which may not align with meaningful directions in this policy space. This mismatch is the root cause of instability.

These issues motivate the central question of this lecture: how can we take larger, more reliable policy improvement steps while guaranteeing that performance does not degrade?

Importance Sampling for Policy Gradients

One natural idea for improving sample efficiency is to reuse data collected under an old policy $\pi_{\theta'}$ when estimating the gradient for a new policy $\pi_\theta$. This requires importance sampling, a technique for estimating expectations under one distribution using samples from another:

$$\mathbb{E}_{x \sim P}\!\left[f(x)\right] = \mathbb{E}_{x \sim Q}\!\left[\frac{P(x)}{Q(x)} f(x)\right] \approx \frac{1}{|D|} \sum_{x \in D} \frac{P(x)}{Q(x)} f(x), \quad D \sim Q.$$

The ratio $P(x)/Q(x)$ is the importance sampling weight. Applying this to the policy gradient, we can express the gradient under policy $\theta$ as an expectation under a different policy $\theta'$:

$$g = \mathbb{E}_{\tau \sim \pi_{\theta'}}\!\left[\sum_{t=0}^{\infty} \frac{P(\tau_t \mid \theta)}{P(\tau_t \mid \theta')} \gamma^t \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, A^\theta(s_t, a_t)\right].$$

However, this formulation has a serious practical problem. The trajectory-level importance weight decomposes as a product over time steps:

$$\frac{P(\tau_t \mid \theta)}{P(\tau_t \mid \theta')} = \prod_{t'=0}^{t} \frac{\pi_\theta(a_{t'} \mid s_{t'})}{\pi_{\theta'}(a_{t'} \mid s_{t'})},$$

where the transition dynamics cancel. Even for policies that are only slightly different, many small ratios multiply together to produce extreme values—either vanishing to zero or exploding to infinity. This makes the variance of the estimator impractically large.

Key Insight

The core challenge is: how can we make efficient use of data collected under an old policy while avoiding the catastrophic variance of full trajectory importance sampling? The answer lies in constraining how far the new policy can deviate from the old one.

Monotonic Improvement Theory

The theoretical foundation for trust-region methods comes from bounding the performance difference between two policies. In the previous lecture, we introduced the surrogate objective:

$$L_\pi(\pi') = \frac{1}{1 - \gamma} \mathbb{E}_{\substack{s \sim d^\pi \\ a \sim \pi}} \!\left[\frac{\pi'(a \mid s)}{\pi(a \mid s)} A^\pi(s, a)\right],$$

which approximates $J(\pi') - J(\pi)$ by using the state visitation distribution $d^\pi$ of the old policy $\pi$ instead of the (unknown) distribution $d^{\pi'}$ of the new policy. This approximation is accurate when $\pi'$ and $\pi$ are close in KL divergence.

Performance Bound

Relative Policy Performance Bound (Achiam, Held, Tamar, Abbeel, 2017). The true performance improvement is bounded by:

$$\left|J(\pi') - J(\pi) - L_\pi(\pi')\right| \leq C \sqrt{\mathbb{E}_{s \sim d^\pi}\!\left[D_{\mathrm{KL}}(\pi' \| \pi)[s]\right]}$$

where $C$ is a constant that depends on the MDP. Rearranging, this gives a guaranteed improvement lower bound:

$$J(\pi') - J(\pi) \geq L_\pi(\pi') - C \sqrt{\mathbb{E}_{s \sim d^\pi}\!\left[D_{\mathrm{KL}}(\pi' \| \pi)[s]\right]}.$$

This bound suggests an algorithm: maximize the right-hand side with respect to $\pi'$. If we define the update rule as

$$\pi_{k+1} = \arg\max_{\pi'} \; L_{\pi_k}(\pi') - C \sqrt{\mathbb{E}_{s \sim d^{\pi_k}}\!\left[D_{\mathrm{KL}}(\pi' \| \pi_k)[s]\right]},$$

then we can prove that $J(\pi_{k+1}) \geq J(\pi_k)$ at every step.

Proof of Monotonic Improvement

Suppose $\pi_{k+1}$ and $\pi_k$ are related by the optimization above. Observe that $\pi_k$ itself is a feasible point, and the objective evaluated at $\pi_k$ equals zero:

$L_{\pi_k}(\pi_k) \propto \mathbb{E}_{s, a \sim d^{\pi_k}, \pi_k}\!\left[A^{\pi_k}(s,a)\right] = 0$ (since the advantage is zero on average under the current policy), and $D_{\mathrm{KL}}(\pi_k \| \pi_k)[s] = 0$ for all $s$.

Therefore the optimal value of the objective is $\geq 0$, which by the performance bound implies $J(\pi_{k+1}) - J(\pi_k) \geq 0$. This argument holds even when we restrict the domain to a parametric policy class $\Pi_\theta$, as long as $\pi_k \in \Pi_\theta$.

In practice, the constant $C$ provided by the theory is quite large when $\gamma$ is close to 1, which makes the penalty too conservative and the resulting steps too small. Two practical remedies have emerged: tuning the KL penalty coefficient (leading to PPO) and using a hard KL constraint (leading to TRPO).

Proximal Policy Optimization (PPO)

Proximal Policy Optimization (Schulman et al., 2017) is a family of methods that approximately enforce the KL constraint without the computational expense of computing natural gradients (as TRPO requires). PPO has become one of the most widely used RL algorithms, powering applications from game-playing agents to the fine-tuning of ChatGPT. It comes in two variants.

Variant 1: Adaptive KL Penalty

The first variant replaces the hard KL constraint with a penalty term, solving the unconstrained optimization problem:

$$\theta_{k+1} = \arg\max_\theta \; L_{\theta_k}(\theta) - \beta_k \bar{D}_{\mathrm{KL}}(\theta \| \theta_k).$$

The penalty coefficient $\beta_k$ is adapted between iterations: if the KL divergence after the update is too large, $\beta_k$ is increased; if it is too small, $\beta_k$ is decreased. This mechanism approximately enforces the desired trust-region constraint without requiring a fixed threshold.

Variant 2: Clipped Surrogate Objective

The second variant—and the one used most frequently in practice—avoids the KL divergence entirely and instead clips the objective function directly.

Definition — PPO Clipped Objective

Define the importance sampling ratio between the new and old policies:

$$r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_k}(a_t \mid s_t)}.$$

The clipped surrogate objective is:

$$L^{\mathrm{CLIP}}_{\theta_k}(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta_k}}\!\left[\sum_{t=0}^{T} \min\!\Big(r_t(\theta)\,\hat{A}_t^{\pi_k},\; \mathrm{clip}\big(r_t(\theta),\, 1-\epsilon,\, 1+\epsilon\big)\,\hat{A}_t^{\pi_k}\Big)\right],$$

where $\epsilon$ is a hyperparameter (typically $\epsilon = 0.2$). The policy update is $\theta_{k+1} = \arg\max_\theta L^{\mathrm{CLIP}}_{\theta_k}(\theta)$.

Why Clipping Works

The clipping mechanism creates a pessimistic bound on the policy improvement. Consider two cases:

  • When $\hat{A}_t > 0$ (the action was better than expected): the objective encourages increasing $\pi_\theta(a_t \mid s_t)$, but the clip at $1 + \epsilon$ prevents the ratio from growing too large. This stops the policy from overcommitting to a seemingly good action based on noisy advantage estimates.
  • When $\hat{A}_t < 0$ (the action was worse than expected): the objective encourages decreasing $\pi_\theta(a_t \mid s_t)$, but the clip at $1 - \epsilon$ prevents the ratio from shrinking too far. This keeps the policy from aggressively avoiding actions that may have been unlucky rather than truly bad.

By taking the minimum of the clipped and unclipped terms, PPO constructs a lower bound on the unclipped objective, ensuring conservative updates in both directions.

The PPO Algorithm

Proximal Policy Optimization (PPO-Clip)
  1. Initialize policy parameters $\theta_0$ and value function parameters $\phi_0$.
  2. For $k = 0, 1, 2, \ldots$:
    1. Collect a batch of trajectories $\{(s_t, a_t, r_t)\}$ by running policy $\pi_{\theta_k}$ in the environment for $T$ timesteps.
    2. Compute advantage estimates $\hat{A}_t$ using GAE (see below) with the current value function $V_\phi$.
    3. For several epochs over the collected batch:
      • Update $\theta$ by maximizing $L^{\mathrm{CLIP}}_{\theta_k}(\theta)$ via minibatch SGD.
      • Update $\phi$ by minimizing $(V_\phi(s_t) - \hat{R}_t)^2$ via minibatch SGD.

Key advantage: the same batch of data is reused for multiple gradient steps before collecting new trajectories, dramatically improving sample efficiency compared to vanilla policy gradients.

PPO converges to a local optimum and, despite its simplicity, achieves strong empirical performance across a wide range of tasks. Its ease of implementation, data efficiency through multiple gradient steps per batch, and conservative updates through clipping have made it the default algorithm for many practitioners.

Generalized Advantage Estimation (GAE)

A key question inside PPO (and indeed any advantage-based policy gradient method) is: how should we estimate the advantage function $\hat{A}_t$? Generalized Advantage Estimation (Schulman et al., ICLR 2016) provides an elegant answer by interpolating between low-bias, high-variance estimators and high-bias, low-variance estimators.

Review: $n$-Step Advantage Estimators

Recall that different $n$-step returns yield different advantage estimators. Define the TD residual:

$$\delta_t^V = r_t + \gamma V(s_{t+1}) - V(s_t).$$

Then the $k$-step advantage estimator is:

$$\hat{A}_t^{(k)} = \sum_{l=0}^{k-1} \gamma^l \delta_{t+l}^V = \sum_{l=0}^{k-1} \gamma^l r_{t+l} + \gamma^k V(s_{t+k}) - V(s_t).$$
Example — Special Cases
  • 1-step ($k=1$): $\hat{A}_t^{(1)} = \delta_t^V = r_t + \gamma V(s_{t+1}) - V(s_t)$. This is just the TD(0) advantage—low variance but potentially high bias if the value function $V$ is inaccurate.
  • 2-step ($k=2$): $\hat{A}_t^{(2)} = \delta_t^V + \gamma \delta_{t+1}^V = r_t + \gamma r_{t+1} + \gamma^2 V(s_{t+2}) - V(s_t)$. Uses one more actual reward before bootstrapping.
  • $\infty$-step ($k=\infty$): $\hat{A}_t^{(\infty)} = \sum_{l=0}^{\infty} \gamma^l r_{t+l} - V(s_t)$. This is the Monte Carlo advantage—zero bias but high variance since it depends on the entire trajectory.

Shorter horizons lean on the (possibly inaccurate) value function, trading bias for low variance; longer horizons use more reward signal but accumulate variance. GAE provides a principled way to blend all of these estimators.

The GAE Formula

GAE defines the advantage as an exponentially weighted average of all $k$-step estimators:

$$\hat{A}_t^{\mathrm{GAE}(\gamma,\lambda)} = (1 - \lambda)\!\left(\hat{A}_t^{(1)} + \lambda\,\hat{A}_t^{(2)} + \lambda^2\,\hat{A}_t^{(3)} + \cdots\right).$$

Expanding and collecting terms:

Definition — Generalized Advantage Estimation
$$\hat{A}_t^{\mathrm{GAE}(\gamma,\lambda)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l \,\delta_{t+l}^V$$

where $\delta_t^V = r_t + \gamma V(s_{t+1}) - V(s_t)$ is the TD residual and $\lambda \in [0, 1]$ controls the bias-variance tradeoff.

Derivation of the GAE Formula

Starting from the weighted sum and substituting the $k$-step estimators as sums of TD residuals:

$$\hat{A}_t^{\mathrm{GAE}} = (1 - \lambda)\!\Big(\delta_t^V + \lambda(\delta_t^V + \gamma \delta_{t+1}^V) + \lambda^2(\delta_t^V + \gamma \delta_{t+1}^V + \gamma^2 \delta_{t+2}^V) + \cdots\Big).$$

Grouping by the TD residual index:

$$= (1 - \lambda)\Big(\delta_t^V(1 + \lambda + \lambda^2 + \cdots) + \gamma\delta_{t+1}^V(\lambda + \lambda^2 + \cdots) + \gamma^2\delta_{t+2}^V(\lambda^2 + \lambda^3 + \cdots) + \cdots\Big).$$

Each geometric series evaluates as $\sum_{j=0}^{\infty} \lambda^j = \frac{1}{1-\lambda}$ (shifted appropriately). After simplification:

$$= (1 - \lambda)\Big(\delta_t^V \cdot \frac{1}{1-\lambda} + \gamma\delta_{t+1}^V \cdot \frac{\lambda}{1-\lambda} + \gamma^2\delta_{t+2}^V \cdot \frac{\lambda^2}{1-\lambda} + \cdots\Big) = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}^V.$$

Bias-Variance Tradeoff in GAE

GAE Bias-Variance Tradeoff

The parameter $\lambda$ smoothly interpolates between two extremes:

  • $\lambda = 0$: $\hat{A}_t^{\mathrm{GAE}(\gamma, 0)} = \delta_t^V = r_t + \gamma V(s_{t+1}) - V(s_t)$. This is the 1-step TD advantage—low variance but biased if $V$ is imperfect.
  • $\lambda = 1$: $\hat{A}_t^{\mathrm{GAE}(\gamma, 1)} = \sum_{l=0}^{\infty} \gamma^l \delta_{t+l}^V = \sum_{l=0}^{\infty} \gamma^l r_{t+l} - V(s_t)$. This is the Monte Carlo advantage—unbiased but high variance.

In practice, values of $\lambda \in (0, 1)$ (commonly $\lambda = 0.95$ or $\lambda = 0.97$) strike a good balance. The optimal choice depends on the quality of the learned value function: a better $V$ permits lower $\lambda$ values without incurring too much bias.

GAE in PPO: The Truncated Version

In practice, PPO uses a truncated version of GAE. Instead of collecting infinite-length trajectories, the agent runs for $T$ timesteps and computes:

$$\hat{A}_t = \sum_{l=0}^{T-t-1} (\gamma\lambda)^l \delta_{t+l}^V.$$

This truncation introduces a small additional bias, but the tradeoff is worthwhile: advantage estimates are substantially lower variance than full Monte Carlo returns, and updates can be triggered after just $T$ timesteps rather than waiting for episode termination.

PPO in Perspective

PPO brings together several ideas into a cohesive, practical algorithm:

Conservative policy updating has deep roots in RL, going back to the early 2000s. PPO is its most practical incarnation to date.

Imitation Learning: Learning from Demonstrations

In many settings, expert behavior already exists—a surgeon operating, a driver navigating, a skilled player completing a game. Rather than learning from reward signals through trial and error (which is sample-intensive and requires careful reward engineering), imitation learning learns a policy directly from those demonstrations.

Definition — Imitation Learning Setup

Input:

  • State space $\mathcal{S}$ and action space $\mathcal{A}$.
  • Transition model $P(s' \mid s, a)$ (sometimes known, sometimes not).
  • No reward function $R$.
  • A set of expert demonstration trajectories $(s_0, a_0, s_1, a_1, \ldots)$ where actions are drawn from an expert policy $\pi^*$.

Goal: Learn a policy that performs as well as (or better than) the expert.

Imitation learning is useful when it is easier for the expert to demonstrate the desired behavior than to specify a reward function that would generate it, or to write the policy directly. Three main approaches exist:

  1. Behavioral Cloning: Directly learn the expert's policy using supervised learning.
  2. Inverse RL: Recover a reward function $R$ from the demonstrations, then optimize a policy for that reward.
  3. Apprenticeship Learning via Inverse RL: Combine reward recovery with policy optimization to match or exceed the expert.

Behavioral Cloning

The most straightforward approach to imitation learning is behavioral cloning, which reduces the problem to standard supervised learning. Given expert state-action pairs $(s_0, a_0), (s_1, a_1), (s_2, a_2), \ldots$, we choose a policy class (neural network, decision tree, etc.) and train a model to predict the expert's action given the state.

Behavioral cloning has a long and successful history. Two early landmarks are Pomerleau's ALVINN system (NIPS 1989), which learned to steer an autonomous vehicle from camera images, and Sammut et al. (ICML 1992), who learned to fly in a flight simulator. More recently, behavioral cloning with recurrent neural networks (BCRNN) has achieved strong results on robot manipulation tasks (Mandlekar et al., CoRL 2021).

The Compounding Error Problem

Despite its simplicity, behavioral cloning has a fundamental flaw rooted in the violation of supervised learning's i.i.d. assumption. In supervised learning, training and test data come from the same distribution. But in an MDP:

When the learned policy makes a small error, it reaches a state that the expert would never have visited. From this unfamiliar state, the policy is likely to make another error, which compounds over time.

Compounding Errors Bound

If the learned policy makes an error at each time step with probability at most $\epsilon$, the expected total number of errors over a horizon $T$ does not scale linearly as $\epsilon T$ (as it would under i.i.d. data). Instead, due to distribution shift, the errors compound quadratically:

$$\mathbb{E}[\text{total errors}] \propto \epsilon T^2.$$

This result is formalized in Ross et al. (2011), "A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning."

This quadratic scaling means that even a policy with a very low per-step error rate can perform poorly over long horizons.

DAgger: Dataset Aggregation

DAgger (Ross et al., 2011) addresses the distribution shift problem by iteratively collecting expert labels along the trajectories induced by the current learned policy, rather than the expert policy.

The key idea is to run the learned policy in the environment, observe the states it visits, and query the expert for the correct action at each of those states. Training on this aggregated dataset—which includes states from the learned policy's own distribution—fixes the distribution mismatch that breaks behavioral cloning.

DAgger provably avoids the quadratic compounding error, achieving a stationary deterministic policy with bounded error. However, it has a key practical limitation: it requires ongoing access to the expert, who must label new states at each iteration. In many applications (surgical robotics, autonomous driving), this interactive querying is expensive or impractical.

Inverse Reinforcement Learning

An alternative to cloning the policy is to recover the reward function the expert is (implicitly) optimizing—the task of inverse reinforcement learning (IRL). A reward function is often more compact and transferable than the policy itself: learn it once, and you can optimize a new policy in a changed environment.

Linear Feature Reward IRL

A common assumption is that the reward function is linear in a set of state features:

$$R(s) = w^T x(s), \quad w \in \mathbb{R}^n, \; x: \mathcal{S} \to \mathbb{R}^n.$$

Under this assumption, the value of any policy $\pi$ can be expressed compactly as:

$$V^\pi(s_0) = \mathbb{E}_{s \sim \pi}\!\left[\sum_{t=0}^{\infty} \gamma^t R(s_t) \;\middle|\; s_0\right] = w^T \mathbb{E}_{s \sim \pi}\!\left[\sum_{t=0}^{\infty} \gamma^t x(s_t) \;\middle|\; s_0\right] = w^T \mu(\pi),$$

where $\mu(\pi)$ denotes the discounted state feature frequency vector under policy $\pi$. Since the expert's policy is optimal, $V^{\pi^*} \geq V^\pi$ for all $\pi$, which means $w^{*T} \mu(\pi^*) \geq w^{*T} \mu(\pi)$ for all $\pi \neq \pi^*$.

Feature Matching

Abbeel and Ng (2004) showed that if a policy $\pi$ matches the expert's discounted feature expectations, it is guaranteed to perform well. Specifically, if $\|\mu(\pi) - \mu(\pi^*)\|_1 \leq \epsilon$, then for all weight vectors $w$ with $\|w\|_\infty \leq 1$ (by Hölder's inequality):

$$|w^T \mu(\pi) - w^T \mu(\pi^*)| \leq \epsilon.$$

However, a fundamental ambiguity remains: there are infinitely many reward functions that make the expert's policy optimal, and infinitely many stochastic policies that can match the expert's feature counts. Which should we choose?

Maximum Entropy Inverse RL

The maximum entropy principle provides an elegant resolution to the ambiguity problem. Among all distributions over trajectories that match the observed feature expectations, choose the one with maximum entropy—that is, the one that introduces no additional preferences beyond what the data demands.

Definition — Maximum Entropy IRL

Given expert demonstrations with average feature counts $\tilde{\mu}$, find the distribution $P(\tau)$ over trajectories that solves:

$$\max_{P} -\sum_\tau P(\tau) \log P(\tau) \quad \text{s.t.} \quad \sum_\tau P(\tau)\mu_\tau = \tilde{\mu}, \quad \sum_\tau P(\tau) = 1,$$

where $\mu_\tau = \sum_{s_i \in \tau} x(s_i)$ is the feature count vector for trajectory $\tau$.

For linear reward functions, this optimization yields an exponential family distribution over trajectories:

$$P(\tau \mid w) = \frac{1}{Z(w)} \exp\!\left(w^T \mu_\tau\right) = \frac{1}{Z(w)} \exp\!\left(\sum_{s_i \in \tau} w^T x(s_i)\right),$$

where $Z(w) = \sum_\tau \exp(w^T \mu_\tau)$ is the partition function. This distribution assigns exponentially higher probability to trajectories with higher cumulative reward, while treating trajectories of equal reward as equally likely.

For stochastic MDPs, the transition dynamics must be incorporated:

$$P(\tau \mid w, P(s' \mid s, a)) \approx \frac{\exp(w^T \mu_\tau)}{Z(w, P(s' \mid s,a))} \prod_{s_i, a_i \in \tau} P(s_{i+1} \mid s_i, a_i).$$

The reward weights $w$ are learned by maximizing the log-likelihood of the observed demonstrations. The gradient takes an intuitive form:

$$\nabla_w L(w) = \tilde{\mu} - \sum_{s_i} D(s_i)\, x(s_i),$$

where $D(s_i)$ is the state visitation frequency under the current model. The gradient is the difference between the expert's empirical feature counts and the learner's expected feature counts; when these match, the gradient is zero and the reward function is consistent with the observed behavior.

MaxEnt IRL vs. Behavioral Cloning

MaxEnt IRL requires knowledge of the transition model (or the ability to sample from it) to compute state visitation frequencies. Behavioral cloning requires no dynamics model—it operates purely on observed state-action pairs. The tradeoff is clear: IRL recovers a more transferable representation but at the cost of additional assumptions and computational overhead.

MaxEnt IRL forms the basis for several subsequent methods, including Generative Adversarial Imitation Learning (Ho and Ermon, NeurIPS 2016), which frames the problem as a GAN-style game between a generator (the learned policy) and a discriminator (distinguishing expert from learned trajectories).

Summary and Looking Ahead

This lecture covered two major themes. First, we completed our treatment of advanced policy gradient methods by developing PPO and GAE—the practical tools that make trust-region policy optimization work at scale. PPO's clipped objective and GAE's bias-variance interpolation together form one of the most effective and widely deployed RL algorithms. Second, we introduced imitation learning as an alternative to reward-based RL, studying behavioral cloning (and its compounding error problem), DAgger's interactive solution, and inverse RL's approach of recovering the underlying reward function.

A recurring theme connects both halves: the importance of working with the right distribution. In PPO, we must keep the new policy close to the old one to avoid distribution shift in importance sampling. In behavioral cloning, distribution shift between expert and learned trajectories causes quadratic error growth. DAgger and inverse RL each address this challenge from a different angle.

In the next lecture, we will dive deeper into imitation learning, explore reinforcement learning from human feedback (RLHF), and see how these techniques are combined to align large language models with human preferences—one of the most consequential applications of RL today.