Lecture 8: Imitation Learning & RLHF

Imitation Learning

So far in this course, we have assumed the agent has access to a reward function that defines the task it should solve. But in many real-world settings—autonomous driving, robotic manipulation, dialogue systems—specifying a reward function is extremely difficult. What if, instead, we have access to an expert who can demonstrate the desired behavior? Imitation learning is the family of methods that learns a policy directly from expert demonstrations, bypassing the need for an explicit reward signal.

The simplest form of imitation learning is behavioral cloning, which treats the problem as a straightforward supervised learning task.

Definition — Behavioral Cloning

Given a dataset of expert demonstrations $\{(s_t, a_t)\}$ collected by rolling out an expert policy $\pi^*$, behavioral cloning trains a parameterized policy $\pi_\theta$ by minimizing a supervised loss:

$$\min_\theta \sum_{(s,a) \in \mathcal{D}} \ell\!\left(\pi_\theta(s),\; a\right)$$

where $\ell$ is a suitable loss function (cross-entropy for discrete actions, mean squared error for continuous actions). The training data consists of state-action pairs drawn from the expert's state distribution $d^{\pi^*}$.

Behavioral cloning is appealingly simple: collect demonstrations, train a classifier or regressor, and deploy. However, it harbors a critical failure mode rooted in the sequential nature of decision-making.

The Compounding Error Problem

Supervised learning assumes that training and test data are drawn from the same distribution. Behavioral cloning violates this assumption in a fundamental way. During training, the state-action pairs come from the expert's trajectory distribution: $s_t \sim d^{\pi^*}$. At test time, however, the learned policy $\pi_\theta$ generates its own trajectory, visiting states drawn from its own distribution: $s_t \sim d^{\pi_\theta}$. Any small mistake early in the trajectory pushes the agent into states the expert never visited—states where the cloned policy has no useful training signal—leading to further errors that cascade.

Distribution Shift Intuition

Consider a self-driving car trained via behavioral cloning. The expert always stays centered in the lane, so the training data only contains near-centered states. If the learned policy drifts slightly to the left, it enters a state unlike anything in the training data and may overcorrect or fail entirely. Each error compounds: one mistake at time $t$ makes all future states unfamiliar, causing more mistakes.

If the per-step error probability is $\epsilon$, naive analysis suggests total expected errors of $\epsilon T$ (treating errors as independent). But with distribution shift and compounding, the true scaling is:

$$\mathbb{E}[\text{total errors}] \propto \epsilon T^2$$

This quadratic dependence on the horizon $T$ means that even small per-step errors become catastrophic in long-horizon tasks. This result is formalized by Ross et al. (2011) in "A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning."

The key insight is that behavioral cloning treats each state-action pair as independent and identically distributed, ignoring the temporal structure of decision-making. In a standard supervised learning setting, a small test-time error on one example does not affect the input distribution of subsequent examples. In an MDP, however, the current state depends on all previous actions. This mismatch between the expert's state distribution $d^{\pi^*}$ (used at training time) and the learner's state distribution $d^{\pi_\theta}$ (encountered at test time) is the root cause of compounding errors.

DAgger: Dataset Aggregation

DAgger (Dataset Aggregation), introduced by Ross, Gordon, and Bagnell (2011), addresses the distribution shift problem directly. Instead of training only on the expert's state distribution, it iteratively collects labels on states that the learner actually visits.

Definition — DAgger

DAgger is an iterative imitation learning algorithm that aggregates training data across multiple rounds. In each round, the current learned policy is executed in the environment to collect a set of visited states. The expert then labels these states with the correct actions, and the resulting state-action pairs are added to the training dataset. The policy is re-trained on the entire aggregated dataset.

By training on states from the learner's own distribution, DAgger directly addresses the train-test distribution mismatch that plagues behavioral cloning.

Algorithm: DAgger (Dataset Aggregation)

Initialize dataset $\mathcal{D} \leftarrow \emptyset$. Train initial policy $\hat{\pi}_1$ from expert demonstrations.
For $n = 1, 2, \ldots, N$:
(a) Form the mixing policy $\pi_n = \beta_n \, \pi^* + (1 - \beta_n) \, \hat{\pi}_n$, where $\beta_n \to 0$ over iterations.
(b) Execute $\pi_n$ in the environment to sample $T$-step trajectories and collect visited states $\{s_1, s_2, \ldots, s_T\}$.
(c) Query the expert $\pi^*$ on these states to obtain labels: $\mathcal{D}_n = \{(s_t, \pi^*(s_t))\}$.
(d) Aggregate datasets: $\mathcal{D} \leftarrow \mathcal{D} \cup \mathcal{D}_n$.
(e) Train $\hat{\pi}_{n+1}$ on the full aggregated dataset $\mathcal{D}$.

The key step is (b): the expert provides corrective labels for the states that the learner actually encounters, not just states the expert would have visited.

Theorem — DAgger Regret Bound

Let $\hat{\pi}$ be the policy returned by DAgger after $N$ rounds, and let $\epsilon_N = \min_{n \in 1..N} \epsilon_n$ be the best per-step classification error achieved on the aggregated dataset. Then the expected cost (total errors over horizon $T$) under $\hat{\pi}$ satisfies:

$$J(\hat{\pi}) \leq J(\pi^*) + T \epsilon_N + O(1/N)$$

As $N \to \infty$, the regret scales as $O(\epsilon T)$—linear in $T$, not quadratic. This matches the "no distribution shift" case and is a significant improvement over behavioral cloning's $O(\epsilon T^2)$ scaling.

DAgger produces a stationary, deterministic policy with provably good performance under its own induced state distribution. The trade-off is that it requires interactive access to the expert: the expert must be available to label new states each round. This is feasible when an expert controller can be queried in a simulator, but may be impractical when the expert is a human who cannot be called upon repeatedly. This limitation motivates approaches that learn from weaker forms of human feedback.

Reward Learning from Human Preferences

In many settings, we want an RL agent whose behavior aligns with human values, but we lack an explicit reward function and cannot easily produce expert demonstrations. A natural alternative is to learn a reward function from human preferences: instead of asking "how much do you like this behavior?" (which yields noisy, poorly calibrated scalar ratings), we ask "which of these two behaviors do you prefer?" Pairwise comparisons are more reliable and easier for humans to provide.

This approach occupies a sweet spot between two extremes. Full demonstrations (behavioral cloning, DAgger) require the expert to show the correct action in every state; learning with no human input requires a hand-coded reward. Pairwise preference labels sit in between: less effort than demonstrations, richer signal than a fixed reward function.

The Reward Ambiguity Problem

Learning rewards from behavior faces a fundamental difficulty. Given expert demonstrations and the transition dynamics, infinitely many reward functions make the expert's policy optimal—the mapping from behavior to reward is one-to-many. This is the core challenge of inverse reinforcement learning: the trivial reward $R(s,a) = 0$ for all $(s,a)$, for instance, makes every policy optimal. To obtain a useful reward function, we need additional structure or assumptions.

The Bradley-Terry Model

The Bradley-Terry model (1952) provides exactly the structural assumption needed to learn rewards from pairwise comparisons. Originally developed for ranking competitors in tournaments, it connects latent "strength" scores (which we interpret as rewards) to observable preference probabilities.

Definition — Bradley-Terry Model

Given $K$ items $b_1, b_2, \ldots, b_K$ with latent reward scores $r(b_i)$, the probability that a human prefers item $b_i$ over item $b_j$ is:

$$P(b_i \succ b_j) = \frac{\exp(r(b_i))}{\exp(r(b_i)) + \exp(r(b_j))} = \sigma\!\left(r(b_i) - r(b_j)\right)$$

where $\sigma(\cdot)$ is the logistic sigmoid function. The model is transitive: the pairwise probability $P(b_i \succ b_k)$ is fully determined by $P(b_i \succ b_j)$ and $P(b_j \succ b_k)$.

To fit the parameters of a Bradley-Terry model, suppose we have a dataset $\mathcal{D}$ of $N$ tuples $(b_i, b_j, \mu)$ where $\mu = 1$ if the human preferred $b_i$, $\mu = 0$ if the human preferred $b_j$, and $\mu = 0.5$ for a tie. The reward parameters are learned by maximizing the likelihood of observed comparisons, equivalently minimizing the cross-entropy loss:

$$\mathcal{L}(r) = -\sum_{(b_i, b_j, \mu) \in \mathcal{D}} \left[\mu \log P(b_i \succ b_j) + (1 - \mu) \log P(b_j \succ b_i)\right]$$

Extending to Trajectory Preferences

The Bradley-Terry framework extends naturally from individual items to entire trajectories. Given two trajectories $\tau^1$ and $\tau^2$, define the trajectory reward as the sum of per-step rewards: $R(\tau) = \sum_{t=0}^{T-1} r(s_t, a_t)$. The probability that a human prefers trajectory $\tau^1$ over $\tau^2$ is then:

$$P(\tau^1 \succ \tau^2) = \frac{\exp\!\left(\sum_{t=0}^{T-1} r^1_t\right)}{\exp\!\left(\sum_{t=0}^{T-1} r^1_t\right) + \exp\!\left(\sum_{t=0}^{T-1} r^2_t\right)} = \sigma\!\left(R(\tau^1) - R(\tau^2)\right)$$

This formulation lets us learn a per-step reward function $r_\phi(s, a)$ (parameterized by a neural network $\phi$) from trajectory-level preference labels. Once trained, this reward model can be used with any RL algorithm—such as PPO—to optimize a policy.

Example — Learning to Backflip from Human Preferences

Christiano et al. (2017) demonstrated that a simulated robot could learn to perform a backflip using only 900 bits of human feedback. A human evaluator was shown pairs of short video clips of the robot's behavior and asked which clip showed better progress toward a backflip. These pairwise labels were used to train a reward model via the Bradley-Terry framework, and PPO optimized the robot's policy against this learned reward. The result was a fluid backflip—achieved without ever writing down a mathematical reward function for "backflipping."

Reinforcement Learning from Human Feedback (RLHF)

Reward learning from human preferences has found its most prominent application in aligning large language models (LLMs). The RLHF pipeline, used in systems like InstructGPT and ChatGPT, combines supervised fine-tuning, reward modeling, and reinforcement learning to steer language model outputs toward human preferences.

Algorithm: The RLHF Pipeline

Step 1 — Supervised Fine-Tuning (SFT). Start with a pretrained language model $\pi_{\text{ref}}$. Fine-tune it on a dataset of high-quality demonstrations (e.g., human-written instruction-response pairs) to obtain an instruction-following base model.
Step 2 — Reward Model Training. Sample pairs of responses $(y_w, y_l)$ from the SFT model for a given prompt $x$. Collect human preference labels indicating which response is better. Train a reward model $r_\phi(x, y)$ by minimizing the Bradley-Terry loss: $$\mathcal{L}(\phi) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\!\left[\log \sigma\!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)\right]$$ where $y_w$ is the preferred (winning) response and $y_l$ is the dispreferred (losing) response.
Step 3 — RL Fine-Tuning with KL Constraint. Optimize the language model policy $\pi_\theta$ to maximize the learned reward while staying close to the reference model $\pi_{\text{ref}}$: $$\max_\theta \;\mathbb{E}_{x \sim \mathcal{D},\; y \sim \pi_\theta(\cdot \mid x)}\!\left[r_\phi(x, y)\right] - \beta\, D_{\text{KL}}\!\left(\pi_\theta \| \pi_{\text{ref}}\right)$$ This is typically optimized using PPO. The KL penalty prevents the policy from diverging too far from the pretrained model, which would lead to reward hacking and degenerate outputs.

Why the KL Constraint Matters

Without the KL penalty, the RL optimization would find adversarial outputs that exploit weaknesses in the learned reward model—a phenomenon known as reward hacking or reward over-optimization. The reward model is an imperfect proxy for true human preferences, so maximizing it without constraint can lead to outputs that score highly according to $r_\phi$ but are gibberish or degenerate to a human reader.

The KL term $D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$ acts as a regularizer, anchoring the optimized policy to the pretrained model. In the per-token formulation used in practice, the modified reward at each step is:

$$\tilde{r}(x, y) = r_\phi(x, y) - \beta \log \frac{\pi_\theta(y \mid x)}{\pi_{\text{ref}}(y \mid x)}$$

The strength of this penalty is controlled by the hyperparameter $\beta$. Larger $\beta$ keeps the policy closer to the reference but limits improvement; smaller $\beta$ allows more reward optimization but risks reward hacking.

RLHF has been validated at scale. Stiennon et al. (2020) showed it significantly improves text summarization beyond supervised fine-tuning alone. Ouyang et al. (2022) scaled the approach to InstructGPT across tens of thousands of tasks. Notably, a 1.3 billion parameter model fine-tuned with RLHF was preferred by human evaluators over a 175 billion parameter model without it—demonstrating that alignment training can be more impactful than raw scale.

Direct Preference Optimization (DPO)

While RLHF is effective, its three-stage pipeline is complex: it requires training a separate reward model, running PPO with all its hyperparameter sensitivity, and managing the interplay between the policy, reward model, and KL constraint. Direct Preference Optimization (DPO), introduced by Rafailov et al. (2023), eliminates both the explicit reward model and the RL loop, collapsing the entire procedure into a single supervised learning objective.

Deriving DPO from the KL-Constrained Objective

The derivation of DPO begins with the KL-constrained RL objective used in RLHF:

$$\max_\pi \;\mathbb{E}_{y \sim \pi(\cdot \mid x)}\!\left[r(x, y)\right] - \beta\, D_{\text{KL}}\!\left(\pi(\cdot \mid x) \| \pi_{\text{ref}}(\cdot \mid x)\right)$$

This objective has a closed-form optimal policy:

$$\pi^*(y \mid x) = \frac{1}{Z(x)}\, \pi_{\text{ref}}(y \mid x)\, \exp\!\left(\frac{1}{\beta}\, r(x, y)\right)$$

where $Z(x) = \sum_y \pi_{\text{ref}}(y \mid x) \exp\!\left(\frac{1}{\beta}\, r(x, y)\right)$ is a normalizing partition function.

The key DPO insight is to rearrange this relationship: instead of writing the optimal policy as a function of the reward, write the reward as a function of the optimal policy. Solving for $r(x, y)$:

$$r(x, y) = \beta \log \frac{\pi^*(y \mid x)}{\pi_{\text{ref}}(y \mid x)} + \beta \log Z(x)$$

Now substitute this expression for the reward into the Bradley-Terry preference model. Because the Bradley-Terry loss depends only on the difference in rewards between the preferred and dispreferred responses, the partition function $Z(x)$ cancels:

$$r(x, y_w) - r(x, y_l) = \beta \log \frac{\pi^*(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi^*(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}$$

This yields the DPO loss function, which optimizes the policy directly from preference data:

Definition — DPO Loss

The Direct Preference Optimization loss is:

$$\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}\right)\right]$$

where $\sigma$ is the logistic sigmoid, $y_w$ is the preferred response, $y_l$ is the dispreferred response, $\pi_\theta$ is the policy being optimized, $\pi_{\text{ref}}$ is the frozen reference model, and $\beta$ controls the deviation strength.

Theorem — DPO-RLHF Equivalence

Under the Bradley-Terry preference model and the KL-constrained RL objective, optimizing the DPO loss $\mathcal{L}_{\text{DPO}}(\theta)$ yields the same optimal policy as the full RLHF pipeline (reward model training followed by KL-constrained RL). DPO implicitly learns a reward model whose optimal policy under the KL constraint is $\pi_\theta$ itself.

The implicit reward is given by $r(x, y) = \beta \log \frac{\pi_\theta(y \mid x)}{\pi_{\text{ref}}(y \mid x)} + \beta \log Z(x)$.

Why DPO Works

Consider the log-ratio $\log \frac{\pi_\theta(y \mid x)}{\pi_{\text{ref}}(y \mid x)}$: it is positive when the policy assigns higher probability to response $y$ than the reference model does, and negative otherwise. The DPO loss increases this ratio for preferred responses $y_w$ and decreases it for dispreferred responses $y_l$.

In effect, DPO performs a form of contrastive learning on the policy: it pushes up the probability of winning responses relative to the reference, while pushing down the probability of losing responses. The sigmoid ensures that the gradient is strongest when the model's implicit ranking disagrees with the human preference, and weakest when they already agree.

Practical Advantages of DPO

DPO offers several practical advantages over the full RLHF pipeline:

Simplicity. No separate reward model needs to be trained or maintained. The policy itself implicitly defines a reward function.
Stability. There is no RL optimization loop with its attendant instabilities—no value function fitting, no advantage estimation, no clipping ratios. DPO is a single supervised loss.
Efficiency. Training is a standard cross-entropy-style optimization on the preference dataset, using the same infrastructure as language model pre-training.

Empirically, DPO achieves competitive or superior performance to PPO-based RLHF across a range of tasks. On sentiment control with GPT-2, it achieves a better reward/KL trade-off than PPO. At larger scale, DPO has been adopted in models such as Mistral and LLaMA 3, demonstrating its viability for production-scale alignment.

Putting It Together: LLM Alignment

Example — Aligning a Language Model

Consider the task of making a language model produce helpful, accurate summaries of news articles. The full alignment workflow proceeds as follows:

Pretrain a large language model on a broad corpus (web text, books, code) to acquire general language competence.
Supervised fine-tuning: further train the model on a curated dataset of (article, human-written summary) pairs to teach the format and style of good summaries.
Preference collection: for each article, generate two candidate summaries from the SFT model. Present both to a human annotator who selects the better summary (or marks a tie).
Alignment training: use either RLHF (train a reward model, then optimize with PPO + KL penalty) or DPO (directly optimize the policy on the preference pairs) to steer the model toward producing summaries humans prefer.

Stiennon et al. (2020) showed that RLHF-trained summarization models were significantly preferred by human evaluators over both the SFT baseline and models trained with larger datasets but no preference optimization. The same paradigm powers the alignment of general-purpose assistants like ChatGPT.

Summary

This lecture explored how agents can learn from human guidance rather than from explicit reward functions, tracing a path from imitation learning to modern LLM alignment. The key ideas are:

Behavioral cloning applies supervised learning to expert demonstrations but suffers from compounding errors due to distribution shift, with total error scaling as $O(\epsilon T^2)$.
DAgger addresses distribution shift by iteratively collecting expert labels on the learner's own visited states, reducing the error scaling to $O(\epsilon T)$, but requires interactive expert access.
Reward learning from preferences uses the Bradley-Terry model to convert pairwise human comparisons into a learned reward function: $P(\tau^1 \succ \tau^2) = \sigma(R(\tau^1) - R(\tau^2))$.
RLHF combines supervised fine-tuning, a learned reward model, and KL-constrained RL optimization to align language models with human preferences. The KL penalty $\beta\, D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$ prevents reward over-optimization.
DPO eliminates the explicit reward model and RL loop by reparameterizing the RLHF objective into a supervised loss on preference pairs, achieving equivalent results with greater simplicity and stability.

The progression from behavioral cloning through DAgger to RLHF and DPO traces a consistent pattern: as human supervision gets weaker (from full demonstrations to pairwise preferences), the algorithms grow more sophisticated to compensate. These methods have become foundational in modern AI, powering the alignment of the large language models that underpin today's conversational AI systems.