Chapter 3 · Alignment, continued · lever: democratization

The recipe gets cheaper

RLHF worked, but the recipe was expensive. Over 2023 and 2024 it was taken apart in two directions at once. The alignment algorithm shed pieces - DPO dropped the reward model and the RL loop, GRPO dropped the critic. The base model stopped being a secret - LLaMA leaked the weights into the open, and Alpaca and Vicuna showed you could fine-tune a respectable assistant on a credit card. The compute stopped being a wall - LoRA, then QLoRA, fit a 65B fine-tune onto a single GPU. Same destination as Chapter 2, an order of magnitude cheaper at every step.

The Chapter 2 recipe - supervised fine-tuning, then a reward model, then PPO - works, and ChatGPT was the proof. But running it required keeping four models resident at once (policy, reference, reward, critic), and PPO is notoriously finicky to tune. Most of the next two years is the same field removing moving parts from that recipe, while a parallel democratization wave brings the base models and the compute within reach of anyone with a single GPU and a weekend.

3.1DPO - your LM is secretly a reward model

The DPO paper, posted to arXiv in May 2023 and a NeurIPS 2023 outstanding-paper recognition, has the best title in the field - "Your Language Model is Secretly a Reward Model." The idea: under the KL-constrained RL objective that RLHF actually optimizes, the optimal policy is related to the reference policy through the reward in closed form. Invert that mapping and the reward function disappears into the policy itself. You can train directly on pairwise preference data - a "chosen" answer and a "rejected" answer - with a single classification loss. No reward model. No PPO rollouts. No critic. (Rafailov et al., 2023; NeurIPS award.)

The DPO loss, in one line

Given a chosen response $y_w$ and a rejected one $y_l$ for prompt $x$, maximize the log-likelihood gap against a frozen reference policy:

$\mathcal{L}_{\text{DPO}} = -\log \sigma\!\left( \beta \log \htmlData{tip=how much more likely the preferred answer is under the trained model vs the frozen reference}{\frac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)}} - \beta \log \frac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)} \right)$

No reward network appears in this loss. The Bradley–Terry preference model has been folded directly into the policy, and the KL anchor to the reference policy survives as the $\beta$ coefficient.

In practice DPO is stable where PPO is twitchy, runs without an online sampling loop, and matches or beats PPO on the standard benchmarks. It also lowered the barrier: you no longer needed a reward-modeling infrastructure to ship an aligned model. A small team with a dataset of preference pairs and an SFT checkpoint could now do alignment with a regular supervised-learning workflow.

3.2GRPO - the group is its own baseline

DPO took alignment out of reinforcement learning. GRPO stayed in, and shed something different. Introduced by DeepSeek in February 2024 inside the DeepSeekMath paper, GRPO drops the PPO critic - the value head that estimates expected return. The critic is roughly as big as the policy and has to be held in memory alongside it; killing it frees a significant chunk of compute. (Shao et al., 2024; PPO baseline: Schulman et al., 2017.)

The trick is the name: group-relative. For each prompt, sample a group of $G$ candidate responses, score all of them with the reward model, and use the group's own statistics as the baseline. Each response's advantage is just how far its reward sits above the group mean, in units of the group's standard deviation:

GRPO advantage

$\hat{A}_{i,t} = \dfrac{r_i - \htmlData{tip=average reward across the G responses sampled for this prompt}{\text{mean}(r_1,\dots,r_G)}}{\text{std}(r_1,\dots,r_G)}$

The group of $G$ samples sharing a prompt is the baseline. No critic network needed.

That's the whole change. Policy and reference model still live in memory. Reward model still scores. PPO's clipped surrogate update still applies. Only the critic - a second model-sized set of activations and optimizer state - is gone. The same GRPO comes back in Chapter 5: it's the algorithm DeepSeek uses to train R1, the open reasoning model that rivals o1.

Three recipes, side by side

The story across this chapter so far is one of progressive simplification. PPO holds four models in memory. DPO needs two and no RL loop. GRPO needs the reward signal but drops the critic.

Interactive · which models are in play pick a method
PPO DPO GRPO
Reward model needed? Yes No Yes
Critic / value head? Yes No No
On-policy sampling? Yes No - static pairs Yes - group of $G$
Baseline for advantage Learned critic (no RL update) Group mean / std

Two different ways to shrink the same recipe - and a hint of what's coming. DPO showed you could align without RL at all. GRPO showed you could keep RL but drop the most expensive piece. By Chapter 5 the lane reaches its logical end: reasoning models trained with verifiable rewards, where the human-labeled reward model is gone too.

3.3The open-weights wave

The other half of this chapter isn't about algorithms. It's about who gets to run them. In February 2023, Meta released LLaMA - seven, thirteen, thirty-three, and sixty-five-billion-parameter base models under a non-commercial research license, available to approved researchers. The headline claim in the paper was already a shock: LLaMA-13B outperforms the 175B GPT-3 on most benchmarks, and the 65B sits in the same league as Chinchilla-70B and PaLM-540B. A 13B model wasn't supposed to be able to do that. (Touvron et al., 2023; Meta announcement.)

A week later, on 3 March 2023, the weights leaked. Someone posted a torrent on 4chan, the files propagated through HuggingFace pull requests, and by the time Meta filed DMCA takedowns the leak was already everywhere. From that moment forward, "open-weights LLaMA derivatives" were a fact on the ground regardless of the license. (Timeline.) When Meta released LLaMA 2 in July 2023 under a commercial-permissive license - with one carve-out for licensees above 700M monthly active users - the open-weights wave stopped being gray-zone leakage and became official Meta strategy. (LLaMA 2 license.)

What happened in the weeks after the LLaMA-1 leak set the template for everything since.

Two weeks in March 2023

Alpaca (Stanford CRFM, 13 March). Take LLaMA-7B, generate 52K instruction-following examples by self-instruct from text-davinci-003, fine-tune. Cost: under $600 total (~$500 in OpenAI API calls plus <$100 in cloud compute). In a blind pairwise evaluation, Alpaca tied text-davinci-003, 90 wins to 89. (CRFM.)

Vicuna (LMSYS, 30 March). Take LLaMA, fine-tune on ~70K real ChatGPT conversations from ShareGPT. LMSYS reported ~$300 for the 13B run and used GPT-4 as a judge to call Vicuna-13B "more than 90% of the quality of ChatGPT and Bard." Their own blog post flagged the eval as "fun and non-scientific" - a useful reminder that an LLM judge can be Goodharted long before the underlying model actually catches up. (LMSYS.)

The headline numbers in those posts aged unevenly, but the shape of the moment is what mattered: in two weeks, a research community with no access to GPT-4-scale training infrastructure had produced credible instruction-tuned assistants for a few hundred dollars each, on top of a base model someone else had paid millions to train. The cost floor of "make a useful chat model" had dropped two orders of magnitude in a fortnight.

3.4PEFT - the compute wall comes down

Alpaca and Vicuna were still full fine-tunes, which is why they leaned on the 7B and 13B sizes. To touch 65B you needed a small cluster. The fix was already waiting: parameter-efficient fine-tuning - freeze the pretrained weights and train a small adapter beside them.

LoRA (Hu et al., June 2021) is the version that won. The pretrained weight matrix $W$ stays frozen; the update is constrained to a low-rank factorization $\Delta W = BA$, where $A \in \mathbb{R}^{r\times k}$, $B \in \mathbb{R}^{d \times r}$, and the rank $r$ is much smaller than $\min(d, k)$. Train only $A$ and $B$. At inference, merge $BA$ back into $W$ - no extra latency. Vs full Adam fine-tuning of GPT-3 175B, the paper reports ~10,000× fewer trainable parameters and ~3× less GPU memory, with no quality drop. (Hu et al., 2021.)

QLoRA (Dettmers et al., May 2023) took the next step: keep LoRA's frozen-base / trainable-adapter idea, but also quantize the frozen base to 4-bit. Three new tricks made it work without degradation. NF4 - a 4-bit NormalFloat data type, information-theoretically optimal for normally distributed weights. Double quantization - quantizing the quantization constants themselves, saving another ~0.4 bits per parameter on average. Paged optimizers using NVIDIA unified memory to spill optimizer state to CPU during the gradient spikes that would otherwise OOM. The payoff: fine-tuning a 65B model on a single 48 GB GPU, 24 hours of training, matching 16-bit full fine-tuning quality. The Guanaco family that came out of this hit 99.3% of ChatGPT on the Vicuna benchmark. (Dettmers et al., 2023.)

Interactive · 65B fine-tune, three recipes pick a bar
A node on the map - you are here

If you've been ML-literate since 2019 and you've been scoping a "fine-tune a small model on my own writing in my own voice" project, this is exactly the node where that plan lives. QLoRA on a single consumer-ish GPU was the credible 2023-vintage path to personal fine-tuning, and it is a real one. Hold onto it - Chapter 7's capstone returns to it and argues, with a chapter's worth of evidence, that for style specifically the prompt lane (Chapter 4) is the better path. Plant the seed; we'll resolve it.

3.5The democratization curve, on one axis

Notice that the two halves of this chapter are the same curve seen from two angles.

Lever 2022 baseline 2023 2024
Alignment algorithm PPO (4 models) DPO (2 models, no RL) GRPO (no critic)
Base model access API-only LLaMA → Alpaca / Vicuna LLaMA-2 commercial
Compute footprint Full FT (cluster) LoRA (single GPU) QLoRA (48 GB)

Every row is the same gesture: keep what you're trying to do, drop something the previous generation needed. None of this changed the shape of the alignment recipe Chapter 2 laid out. Better algorithms, more open weights, smaller GPUs - the map is just being filled in cheaper.

3.6What this chapter didn't touch

Two things, on purpose. The PEFT family is wider than just LoRA and QLoRA - adapters, prefix-tuning, IA³, others - and they all live in the weights lane. The PEFT survey is the place to go if you want the taxonomy. (Han et al., 2024.)

And note what every method in this chapter still does: it touches the weights. DPO, GRPO, LoRA, QLoRA - cheaper at every step, but still gradient descent into parameters. The next chapter asks a stranger question. What if you never touch the weights at all, freeze the model entirely, and treat the prompt as the thing you optimize? That move turns out to rebuild this entire RL machinery out of language.