Chapter 7 - Computational cognition · From Attention to Agents

Six levers. Architecture, scale, alignment, context, test-time compute, agency - each one a separate bet, each one a separate paper trail. The system you woke up to in 2026 is what you get when all six are turned on at once. This chapter pins them on one map, names the villain that haunts every lever (Goodhart), and uses the map to design the small tool that has been waiting since Chapter 3: rewrite arbitrary text in your voice, without training anything.

7.1The six levers on one page

You can describe a modern model two ways. One: a 2017 architecture (Vaswani et al.) trained at scale on the internet, post-trained for helpfulness, prompted carefully, given room to think, then handed tools. Two: a stack of optimizations applied to a frozen substrate - some baked into weights, some pinned into context. The through-line of the whole course is that the optimization moved from one place to the other as we figured out how, and the modern system uses both at once.

Chapter	Lever	What it optimized	Primary anchor
Ch0 · 2017–19	Architecture	replace recurrence with self-attention	Vaswani 2017
Ch1 · 2020–22	Pretraining / scale	a power law you can buy capability from	Brown 2020 · Kaplan 2020 · Hoffmann 2022
Ch2 · 2022	Alignment (RLHF)	a completer turned into an assistant	Ouyang 2022 · Wei 2022
Ch3 · 2023–24	Alignment, cheaper	drop the reward model, drop the critic	Rafailov 2023 · Shao 2024
Ch4 · 2022–25	Context / prompt	optimize the input, not the parameters	GEPA 2026 · DRPO 2024
Ch5 · 2024–25	Test-time compute	spend tokens at inference to think longer	o1 preview · DeepSeek-R1
Ch6 · 2024–26	Agency	thought → action → observation	ReAct (Yao 2022) · computer use (beta)

Toggle the levers and watch a (deliberately illustrative, not benchmark-accurate) capability bar assemble. The order is the order the field discovered them; it is not a strict dependency graph. Architecture without scale gets you BERT-era models that can read but not converse. Scale without alignment gets you GPT-3, which can recite anything and follow nothing. Add the rest and you get the assistant you used this morning.

Interactive · the six levers toggle to see the contribution

7.2Three threads close here

The CoT seed grows three times

A single idea - make the model show its work - appeared at three levels of the stack. First in Chapter 2 as a prompt trick: write "let's think step by step" and accuracy on multi-step problems jumps without touching a weight (Wei et al., 2022). Then in Chapter 4 as a target: GEPA and MIPROv2 jointly optimize instructions and few-shot demos, so the search itself rewrites the chain of thought it asks the model to produce (Agrawal et al., 2026 ; MIPROv2 docs). Then in Chapter 5 as a weight update: DeepSeek-R1-Zero was trained with pure RL (GRPO + RLVR) on verifiable math, and a coherent reasoning trace emerged as the policy. On AIME 2024, R1-Zero went from 15.6% to 71.0% pass@1 during training, with 86.7% at majority-vote-of-64, "matching the performance of OpenAI-o1-0912" (DeepSeek-AI 2025).

Same idea, three levels: prompt → optimization target → weights. That is the whole point of the through-line. Once a useful behavior shows up at one level, the field finds a way to harvest it at every level.

The democratization curve

Every breakthrough got cheaper and more open. The alignment recipe shrank from RLHF's three-stage pipeline to DPO's "your LM is secretly a reward model" (Rafailov 2023) and then to GRPO's group-relative advantage with no critic at all (Shao 2024). Adaptation shrank from full fine-tuning to LoRA and then to QLoRA's 4-bit NF4 trick, which fits a 65B model on a single 48GB GPU (Dettmers 2023). The model layer shrank from closed APIs to LLaMA to DeepSeek-R1, an open-weights reasoning model released MIT-licensed in January 2025.

The pattern is not just "things get cheaper." The pattern is that each generation's frontier trick becomes the next generation's default, and the new frontier moves up.

The eval shadow

How we measure is its own history. BBH set the scoreboard for prompted reasoning (Suzgun 2022). MMLU framed broad knowledge (Hendrycks 2020). SWE-bench moved the goalposts from quiz answers to real software work, scoring on 2,294 GitHub issues across 12 Python repos (Jimenez 2024). And once the tasks got open-ended, the judge itself became a model: GPT-4 as a pairwise rater achieves ≥80% agreement with human raters on MT-Bench (Zheng 2023) - close enough to human-human agreement that the field shipped it.

The villain

Goodhart's law, in Marilyn Strathern's popular phrasing (1997): "When a measure becomes a target, it ceases to be a good measure." It is the antagonist for every lever. In alignment it is reward hacking against the reward model. In prompt optimization it is the prompt evolving features that please an LLM judge but drift away from what humans actually want. In agency it is a tool-using agent finding the shortest path to the verifier, not to the task. The story of the next decade is partly the story of how the field stops being fooled by its own scoreboards.

7.3Multimodality, the thread that opened in Ch1

When you read about ViT and CLIP in Chapter 1, the point was that vision joined the same architectural family. That thread closes here. By 2024, frontier models were natively multimodal across text, image, audio, and (with longer contexts) video. GPT-4o shipped May 13, 2024 with text/image/audio in and out. Gemini 1.5 Pro was announced February 15, 2024 as "natively multimodal" with a 1M-token context preview spanning image, audio and video alongside text (tech report). "Natively multimodal" is now the table stakes of a frontier release, not a separate research line.

Scope

Specific 2026 model versions, benchmark deltas, and leaderboard positions move fast. The verified statements above are anchored to release dates and primary reports. Newer frontier claims belong in a release-notes addendum, not the chapter spine.

7.4The capstone: a voice tool, finally placed on the map

Back in Chapter 3, when we walked through QLoRA, there was a personal note. The reader had been carrying around a 2019-era idea: fine-tune a small model on their own writing so it could rewrite arbitrary text in their voice. Chapter 3 placed QLoRA as a real and valid democratization node - that idea was not wrong, exactly. It was simply the right idea for the wrong lever. The map we just drew tells us which lever it should be.

Style is steering, not knowledge. Voice is a direction in a frontier model's existing capability surface, not a new fact set you need to bake into weights. That is precisely the regime where discrete prompt optimization shines, and it is what makes the prompt lane (Ch4) the correct lever here.

Three things decide it, and all three favor a frozen frontier model:

Quality ceiling. A frontier model already writes better than any small model you could realistically fine-tune. You only need to steer it.
Iteration speed. "Sounds like me" is subjective and needs a human in the loop. Prompt search iterates in seconds; fine-tuning iterates in hours.
The artifact. The output is a prompt you can read, edit, and carry to any model, not a LoRA welded to one base.

The recommended build, in primitives the literature already verified

Nothing in the build below is new science. It is a composition of pieces every previous chapter introduced: GEPA's reflective mutation and Pareto frontier, DRPO's tuning-free prompt search, MT-Bench-style LLM-judge calibration, and Chatbot Arena's pairwise-to-Elo aggregation. The novelty is in the recipe and in naming the failure mode.

The build · Option 1

Generator: a frontier API model (the best voice, no fine-tune, no distill).
Engine: a GEPA/DRPO-style reflective prompt search. Reuse an existing optimizer (DSPy's GEPA implementation); do not rebuild genetic operators.
Search space: a distilled style rubric jointly optimized with a selected set of the user's real writing as exemplars - the MIPROv2 lesson, with register-matched exemplars per request (README voice is not texting voice).
Fitness signal: never a 1–10 score. Pairwise clicks → Elo, with an LLM judge calibrated against the human clicks. Active sampling spends the user's clicks where the judge is least sure; held-out prompts catch the judge gaming the system.
Selection: Pareto over (voice fidelity, task fidelity, length, safety) so a single objective cannot eat the others.

That fitness signal is the only genuinely new component, so it is worth feeling. Click which rewrite sounds more like a single coherent voice; an Elo ranking emerges from nothing but binary choices. This is the loop running inside step 4 of the build.

Interactive · pairwise → Elo pick the one that sounds like a person

The villain, named

The same Goodhart attractor from earlier returns here as judge-drift. If the LLM judge is the loop's fitness function, the search will eventually find prompts that please the judge in ways the user would not endorse. The literature already gives three mitigations and they all stack:

Pareto selection over multiple objectives, so the search cannot collapse to one. GEPA does this natively.
Periodic human-pairwise re-calibration of the judge, the same loop DRPO and Chatbot Arena use to keep their signals honest.
Held-out audit prompts the judge never sees during search, so reward-hacking is detectable instead of invisible.

What this resolves

The reader's original instinct - train a small model on my writing - was not wrong. It was a real and valid Chapter 3 node, and the QLoRA paper made it practical for someone with one consumer GPU. But the convergence the course has been tracking changes the answer. Optimizing the context can now rival optimizing the weights, far cheaper, with an artifact you can read, port, and audit. The voice tool is the same idea as the reader's 2019-era plan, executed on the lever the field has spent four years sharpening into the better one.

One sentence

For three years we learned to align models by training them. Then we learned we could often get there by writing a better prompt. So to rewrite text in your own voice, the right move is not to train anything at all. It is to search.