Chapter 2 · 2022 · lever: alignment

Teaching it to be helpful

GPT-3 could write, but it could not answer. It was trained to continue documents, not to follow you. 2022 is the year one group at OpenAI closed that gap with a single recipe shipped twice - once as a paper, once as a product - while another group at Google, almost in passing, showed the model could already reason if you just asked it to show its work.

RLHF InstructGPT ChatGPT CoT BBH

A base language model predicts the next token in a document - it does not know there is a user across the table. The intent gap is the whole motivation for post-training. This chapter walks the recipe that closed it (RLHF via SFTreward modelPPO), the product launch that turned it into the iPhone moment of AI, and one small prompting trick on the side that we will return to twice more in this book.

1.The gap a base model leaves open

Ask a base GPT-3 "Explain photosynthesis to me." A model trained on the open web will often respond by continuing the document - maybe a list of related homework questions, maybe a confident non-answer, maybe nothing at all. It is doing exactly what it was trained to do: predict the next token of plausible internet text. The pretraining objective never mentioned the word user.

Ouyang et al. named the gap directly. The opening line of the InstructGPT paper: "We show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback." The three properties they wanted out the other side were already in the air - helpful, honest, and harmless (the "3 H's", from Askell et al. 2021) - and the paper became the canonical recipe for getting there. (Ouyang et al., 2022; Askell et al., 2021.)

The intent-alignment gap

A pretrained LM is optimized to predict the next token over a giant text corpus. A useful assistant is optimized to satisfy the user's request. Those objectives overlap, but they are not the same. Closing the gap means moving the model's behavior from "completes documents" toward "answers questions" - without destroying the knowledge it learned in pretraining. That move is post-training.

2.The three-stage recipe

InstructGPT's contribution was less an algorithm than a clean sequence. Read the paper for the math; the recipe in three lines is:

RLHF, as InstructGPT runs it
  1. Collect demonstration data; train a supervised policy.
  2. Collect comparison data; train a reward model.
  3. Optimize a policy against the reward model using PPO.

Each stage builds on the previous one - stage 2's comparisons rank the outputs of stage 1's policy, and stage 3's PPO updates stage 1's model using stage 2's reward. Walk through the pipeline below. Click any stage, or hit play.

Interactive · the RLHF pipeline click a stage or press play

Stage 1 · SFT, the cheap foundation

Hire about 40 labelers through Upwork and Scale AI. Sample prompts from real API traffic and from the labelers' own imaginations. Have them write the ideal completion - the response a careful, well-informed human would give. Then supervised-fine-tune the base GPT-3 on the resulting (prompt, ideal completion) pairs.

This step alone moves the model dramatically. It learns the surface form of helpfulness: start with a direct answer, use the user's terminology, avoid drifting. SFT is by far the cheapest of the three stages and does most of the visible work. The next two stages push past what demonstrations can teach.

Stage 2 · the reward model is a small judge

You can write a few thousand demonstrations. You cannot write a million. To scale, InstructGPT switches from generation to comparison: a labeler is shown 4 to 9 candidate responses to one prompt and ranks them. Comparisons are easier than writing - and they directly capture relative preference, which is what you ultimately want.

A reward model is a transformer with a scalar head. It is trained on the pairwise comparisons with a simple loss: for each pair where yw was preferred over yl, push the reward of the winner above the loser. The loss is the Bradley-Terry form - the reward difference is the log-odds that a labeler will prefer the first response:

$\mathcal{L}_{\text{RM}} = -\log \sigma\!\left(\, \htmlData{tip=scalar reward the RM assigns to the preferred response y_w}{r_\phi(x, y_w)} - \htmlData{tip=scalar reward the RM assigns to the rejected response y_l}{r_\phi(x, y_l)} \,\right)$

The detail that always surprises: InstructGPT's biggest policy is 175B parameters, but its reward model is only 6B. A small judge is enough. The reward model does not need to write - it only needs to recognize. (Ouyang 2022, §3.5.)

Stage 3 · PPO, with a leash

Now the reward model is frozen and used as the reward function for reinforcement learning. Sample a prompt, have the policy produce a response, score it with the RM, push the policy's weights in the direction that earns more reward. The optimizer is PPO - chosen for stability rather than elegance.

The catch is reward hacking. A policy that purely climbs the RM will drift into word-salad that scores high but reads as gibberish - the RM has its own blind spots, and the policy will find them. So PPO adds a KL penalty to the SFT model: at each step, penalize how far the policy's token distribution has drifted from the SFT policy's distribution. The leash keeps it honest.

Why a KL anchor, not a clip

The RM is a finite-sample approximation of human preference. If the policy is allowed to climb it without bound, the gap between RM score and actual human preference widens - classic Goodhart's law. The KL penalty caps how much the policy can move away from a known-decent starting point per step, slowing the divergence. It is not a fix; it is a brake. Reward hacking is going to come back in every later chapter.

The result that made the recipe famous

The numbers are sharp enough to be worth memorizing. On the labelers' held-out preference ratings:

Comparison Preference for InstructGPT
1.3B InstructGPT vs 175B GPT-3 preferred - despite 100× fewer params
175B InstructGPT vs 175B GPT-3 (no few-shot) ~85%
175B InstructGPT vs few-shot 175B GPT-3 ~71%

A 100×-smaller post-trained model preferred to a giant raw one. Most of the knowledge was already there in GPT-3; the missing piece was knowing what was being asked of it. Alignment unlocked the existing capability rather than adding new capability - a pattern that will recur. (Ouyang 2022, Fig. 1.)

3.ChatGPT · the iPhone moment

Eight months after the InstructGPT paper landed on arXiv, OpenAI shipped the same recipe on a GPT-3.5 model as a free chat interface. ChatGPT launched on November 30, 2022 as a "research preview." The launch post described the training pipeline as "a method very similar to InstructGPT" with the obvious modification for multi-turn dialogue. (OpenAI, "Introducing ChatGPT," Nov 30 2022.)

What followed was the fastest consumer adoption curve in the history of software. A million users in roughly five days, ~100M monthly actives by January 2023 according to UBS / SimilarWeb estimates reported by Reuters. The model itself was barely different from InstructGPT. The interface and the moment were. (Reuters, Feb 1, 2023.)

Why this hit when the API had been live for years

GPT-3 had a public API since June 2020. The leap from API to chatbox is small in code, enormous in audience. RLHF made the model conversational instead of merely completing - and conversation is the medium most humans already speak. Alignment was the technical change; the UI was the product change. They arrived together, which is how a research preview becomes a cultural artifact.

4.The other thing that happened in 2022: just ask it to show its work

Five weeks before InstructGPT, a different paper out of Google Brain - Wei, Wang, Schuurmans, and colleagues - pointed out something almost embarrassing. If you give a large language model a few worked examples of step-by-step reasoning and then ask it your question, its accuracy on multi-step problems goes up. A lot. No weights moved. (Wei et al., 2022.)

The technique is so simple it feels like cheating. Prepend a few demonstrations like "Q: ... A: First we ... then we ... so the answer is 27." instead of "Q: ... A: 27." The model picks up the pattern and produces its own intermediate steps before committing to a final answer. This is chain-of-thought prompting. The improvement is largest where you most need it: arithmetic, multi-step word problems, symbolic manipulation.

A GSM8K question, two prompt styles

Question. Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?

Answer-only prompt. "A: 27." (wrong; the model pattern-matches numbers and rushes.)

Chain-of-thought prompt. "A: Roger started with 5. 2 cans × 3 balls = 6 new balls. 5 + 6 = 11. The answer is 11." (right; same model, same weights.)

The headline number from the paper: PaLM 540B with 8-shot chain-of-thought reached 56.9% on GSM8K, where the prior state-of-the-art - a fine-tuned GPT-3 with a learned verifier - was 55%. With an external calculator the same setup pushed to 58.6%. Prompt engineering matched and then beat a system that touched weights.

The other observation buried in the paper turned out to matter more than the headline: chain-of-thought only helps once the model is large enough. Below roughly 100B parameters, asking for reasoning steps mostly hurts. Above it, the gap opens up dramatically. Wei et al. coined the term for this: emergent ability - a capability that appears suddenly with scale rather than improving smoothly.

BBH: the eval bed for the seed

Numbers on GSM8K are not enough to establish that reasoning has been unlocked broadly. Later in 2022, Suzgun et al. cut a 23-task subset out of BIG-Bench called BIG-Bench Hard (BBH) - chosen because prior model evaluations had failed to beat the average human rater on them. The residual hard pile. With plain prompting, large models still trailed humans. Add chain-of-thought, and Codex (code-davinci-002) surpassed the average human on 17 of 23 tasks; PaLM 540B on 10 of 23. (Suzgun et al., 2022.)

BBH is going to be the eval shadow over the next three chapters. "+X points on BBH" appears in almost every prompt-optimization paper from 2022 onward. The point of the benchmark is not to be the final word on reasoning - it is to be a hard-enough bar that prompt-level interventions can register as real progress.

5.What this chapter plants for later

Chain-of-thought arrives here as a prompt trick - zero weights touched. That framing is going to grow twice. In Chapter 4 we will see the same idea promoted into an optimization target: GEPA and friends search for the best CoT-shaped prompt automatically, treating wording as something you can optimize. In Chapter 5 the idea grows again, this time directly into the weights: o1 and DeepSeek-R1 train models to produce long internal reasoning traces by reinforcement learning on verifiable rewards. "Make the model show its work" surfaces at three different layers of the stack. The seed is here.

Three appearances of the same idea

Ch 2 (here): CoT as a prompt trick. No training. Few-shot exemplars include intermediate reasoning, the model copies the pattern.
Ch 4: CoT as a search target. Automatic prompt-optimization methods evolve the wording of those exemplars. The prompt becomes a parameter.
Ch 5: CoT as a trained capability. RL on verifiable rewards bakes the "think step by step" trace into the weights themselves.

6.The recipe is great. It is also a mess.

Read the InstructGPT paper carefully and the cost shows up everywhere. Forty labelers on contract. ~13,000 demonstrations. ~33,000 ranked comparisons. Three sequential training runs, each on top of the previous one. PPO is famously high-variance - small hyperparameter changes wreck a run. You need a frozen reference policy in memory to compute the KL penalty, a separate critic to estimate value, and the reward model alongside the policy you are actually updating. Four models loaded at once, and an RL loop on top.

It works. It also screams to be simplified. The next chapter is the cost curve breaking downward: DPO drops the reward model and the RL loop and learns directly from preference pairs with a single classification loss; GRPO keeps RL but drops the critic by using a sampled group as its own baseline. The same alignment, with fewer moving parts and a much shorter cycle time.