Chapter 5 · Learning to think · From Attention to Agents

Three moves. First, a second knob appears: instead of (only) scaling training, you scale the amount of compute spent at inference - the model thinks longer before it answers, and accuracy keeps climbing. Second, the trick we met in Chapter 2 as a prompt (chain-of-thought) and chased in Chapter 4 as an optimization target now moves into the weights, taught by reinforcement learning with answers a program can check. Third, the recipe goes open in a single weekend with R1, and the cost story of frontier reasoning collapses.

5.1 A second scaling axis

Chapter 1's scaling laws said the same thing for years: pour more compute into pretraining and loss falls along a power law. The axis was always train-time. What if there is a second, parallel axis - spend more compute at inference, and let the model chew on the problem longer before it commits?

Snell, Lee, Xu, and Kumar named it cleanly in August 2024: Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. Their headline finding: when you spend test-time compute well - search against a process reward model or adaptively update the model's distribution during decoding - a small base model with a thinking budget can outperform a 14× larger model in a FLOPs-matched comparison. Same problem, same total compute; just spent differently.

Insight - two axes

You can buy capability with training compute (bigger model, more tokens) or with inference compute (think longer, sample more, search). The two trade off. For problems where checking is cheaper than solving - math, code, anything with a verifier - inference compute is the cheaper coin.

The plumbing for this was already there. Lightman et al. published Let's Verify Step by Step in May 2023, introducing process reward models that score each intermediate step instead of just the final answer. PRM-guided best-of-N with a pre-RLHF GPT-4 generator hit 78.2% on MATH - a hint that the gains from thinking-longer plus checking-along-the-way were real.

5.2 o1, and the curve that sells the idea

On September 12, 2024, OpenAI released o1-preview and o1-mini. Their own framing is the cleanest statement of the new paradigm: “Performance consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute).” The full o1 followed on December 5, 2024 as part of the 12 Days of OpenAI drop, with the o1 system card and a $200/month Pro tier launched the same day.

OpenAI doesn't show you the actual chain. The user-facing “thinking” you see is a summary generated from a hidden internal trace. Stated reasoning: keep the raw CoT clean for safety monitoring (don't train policy compliance onto it), and don't give competitors a corpus. The mechanism beneath - large-scale RL on long internal reasoning - is acknowledged; the algorithmic specifics aren't.

Note - what we actually know about o1

Training algorithm, model size, and total compute aren't disclosed. Anything beyond “large-scale RL on chain-of-thought” is speculation - PRMs, MCTS, GRPO have all been guessed at by outside observers, none confirmed.

The number that does the work is the AIME 2024 curve. One o1 sample lands 74%. Take the majority vote over 64 samples and you climb to 83%. Re-rank 1,000 candidates with a learned scoring function and you're at 93%. Same weights every time. The only thing that changes is how much compute you spend at inference.

Demo · AIME 2024 vs reasoning budget click the three regimes

Codeforces went from “below average” for GPT-4o to the 89th percentile for o1. MATH-500, GPQA Diamond at PhD-level accuracy on physics / bio / chem - the pattern repeats. The headline takeaway isn't any single number; it's that a budget knob you didn't have last year suddenly buys you capability nothing was buying you before.

5.3 R1: the open weekend

Four months later, on January 22, 2025, DeepSeek posted DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning - later published in Nature - with open weights and an MIT license. Two siblings shipped:

DeepSeek-R1-Zero. RL applied directly to DeepSeek-V3-Base (671B-parameter MoE, ∼37B active per token). No supervised cold start. Reasoning behaviors - self-checking, branching, backing up to fix earlier steps - emerge under pure RL.
DeepSeek-R1. A small cold-start SFT on curated CoT traces before RL, then a multi-stage pipeline (reasoning RL with a language-consistency reward to stop the language-mixing → SFT → preference RL). Same idea, with the rough edges sanded down.

The training algorithm is GRPO from Chapter 3 - the critic-less, group-relative variant of PPO - paired with RLVR: instead of a learned reward model, the reward is a program that checks whether the answer is right. Math: did the final boxed number match? Code: did the unit tests pass? Pull the lever, get a 0 or a 1; no human labels in the loop.

Definition - RLVR

Reinforcement learning with verifiable rewards. Replace the learned reward model from RLHF (Ch 2) with a deterministic checker for tasks where correctness is mechanical: math-equality, unit-test pass, format match. Cheap, exact, un-hackable in the small. Coined as a recipe in Tülu 3 (Lambert et al., AI2, Nov 2024); the actual mechanism - GRPO plus a math/code verifier - was already running in DeepSeekMath in Feb 2024.

5.3.1 The pipeline as a single picture

Diagram · R1 training pipeline click any stage

The benchmark numbers on R1 are DeepSeek's own evals against the o1-1217 snapshot (the December 17, 2024 build): 79.8% on AIME 2024, 97.3% on MATH-500, Codeforces Elo of 2029 (96.3rd percentile). Treat these as “DeepSeek reports parity with o1,” not “objectively matches o1” - OpenAI hasn't published a side-by-side.

5.3.2 The “aha moment”

During R1-Zero training, two things happen on their own. Response length grows steadily. And the model starts spontaneously stopping mid-derivation to revisit and correct earlier steps - the now-famous Wait... pattern. DeepSeek frames it as emergence under pure RL. Honest hedge: follow-up work (notably oat-zero) finds that base models already produce reflection tokens, so part of what RL is doing is eliciting a capability that was latent, not creating it from nothing. The behavior is real; the “emergence” framing has caveats.

Demo · reasoning trace, step by step click Next to advance

5.4 The CoT seed, three chapters later

The chain-of-thought trick we met in Chapter 2 was a prompting finding: Wei et al. (NeurIPS 2022) noticed that adding “let's think step by step” lifted math accuracy on the base model, no weights changed. In Chapter 4 it became a target for prompt optimizers - GEPA, OPRO, DSPy - searching for the best CoT-shaped instruction. Here it crosses the last bridge: it becomes part of the weights. The model doesn't need a prompt that says “think step by step” anymore. It just thinks step by step, because that's the behavior that paid off under RLVR.

The lever is the same idea each time - make the model show its work. Where it lives keeps changing.

Chapter	Where the thinking lives	Mechanism	Cost shape
Ch 2 · 2022	In the prompt	“Let's think step by step.” A discovery, not training.	Free at inference; nothing to train.
Ch 4 · 2022 – 2025	In the optimizer's search	GEPA / OPRO / DSPy search for the best CoT-shaped instruction.	Many forward passes per search step; no gradient updates.
Ch 5 · 2024 – 2025	In the weights	RL on long CoT against verifiable rewards (GRPO + RLVR).	Large RL bill once; expensive inference per query, but no prompt-engineering at deploy time.

5.5 The cost shock

The viral number from R1 week was $5.576M: the cost DeepSeek reported for the V3 base-model training run. The figure is real and the source is DeepSeek's own V3 technical report - but read the footnote. It covers the official training run only, at then-current GPU-rental rates. It excludes prior research, ablations, and all of the R1 post-training. Independent estimates (SemiAnalysis via CNBC) put DeepSeek's total hardware spend in the hundreds of millions.

Note - the punchline isn't the exact number

The shock wasn't “$6M trains a frontier model.” The shock was that an open-weights model rivaling o1 dropped at all - trained for something on the order of a frontier lab's monthly bill, in a regime everyone assumed required rumored billions. Markets agreed: on January 27, 2025, NVIDIA shed roughly $589 billion of market cap, the largest single-day loss in US history.

And then it kept getting cheaper. A week after R1, s1 (Muennighoff et al., Jan 31 2025) SFT'd Qwen2.5-32B-Instruct on a curated 1,000-trace dataset and added a single test-time trick: appending the token “Wait” to force the model to keep thinking, or cutting it off to make it stop. They call it budget forcing. The result beat o1-preview on MATH and AIME24 by up to 27%. A thousand examples, one ugly token, off-the-shelf base model. The recipe had democratized in days.

5.6 What this chapter changed

A new lever joined the stack. You can now buy capability by spending compute at inference, not just at training - and on math-and-code shaped tasks where a verifier is cheap, that inference compute is the cheaper coin. The chain-of-thought seed planted in Chapter 2 finally went all the way through the stack and into the weights, via GRPO from Chapter 3 with RLVR. And the cost frontier of reasoning collapsed in a single weekend in January 2025, with the recipe spreading to a 32B distill within days.

The lever is now test-time compute: more thinking at inference, baked-in by RL against a checker. What this still doesn't give you is action. A model that can think for 60 seconds about a math problem still can't open a terminal, run a script, read the output, and try again. That's the next chapter.

Through-line - CoT, three times

Same idea, three altitudes: prompt trick (Ch 2) → prompt-optimization target (Ch 4) → trained directly into the weights (Ch 5). When you see a finding in this field, ask the next question: which level of the stack does it eventually move to?