Steering without retraining - From Attention to Agents

Three chapters of alignment, and every step still touched the weights. Now we put the gradients down. The model is frozen; the lever is the prompt. We will trace the discrete branch of prompt optimization from APE in late 2022 to GEPA in 2025, watch the chain-of-thought trick from Ch 2 grow up into something a search can target, and arrive at the result that gives this chapter its punchline: a prompt optimizer that matches reinforcement learning while spending up to 35× fewer rollouts to get there (Agrawal et al., ICLR 2026).

4.1Two ways to change a mind

A language model has exactly two surfaces you can move. The weights - what Ch 2 and Ch 3 spent the whole alignment recipe optimizing. And the context - the tokens you put in front of it at inference time. Pick one and you have picked your lane.

The reason the second lane exists at all goes back to GPT-3. A frozen model could perform a brand-new task purely from a few examples placed in its prompt - in-context learning, no gradient steps (Brown et al., 2020). If behavior can be steered by what sits in the context window, then optimizing the context is a real alternative, not a workaround.

The second hint landed two years later. Prompt a large model to "think step by step" and its accuracy on multi-step problems jumps - not because the weights changed, but because the intermediate tokens give it room to compute. Chain-of-thought (Wei et al., 2022) turned the prompt from a question into a small program the model runs on itself. It was the first loud proof that how you phrase a prompt is worth measurable accuracy - and once phrasing is worth points, it becomes something worth optimizing.

The scoreboard you'll see everywhere

Almost every paper in this chapter reports "+X% on BBH." BIG-Bench Hard is a suite of 23 tasks chosen because pre-2022 models failed to beat the average human on them. With plain prompting models lagged; add chain-of-thought and PaLM exceeded the average human rater on 10/23 and code-davinci-002 on 17/23. Same weights, different prompt (Suzgun et al., 2022). BBH is the eval shadow over this whole chapter.

One name, two different methods

"Prompt tuning" refers to two methods that share nothing but a name - and the mix-up derails almost everyone arriving from classical ML. One uses gradients on continuous vectors and is really a member of the PEFT family from Ch 3. The other, this chapter's subject, searches over actual words and never touches a gradient.

	Soft / prefix tuning	Discrete prompt optimization
What you optimize	Continuous embedding vectors prepended to the input	The natural-language prompt itself (words)
How	Gradient descent - a PEFT method	Search; an LLM proposes edits - no gradients
Touches weights?	Yes (a few learned vectors)	No - the model is fully frozen
Readable result?	No (opaque vectors)	Yes - you can read and edit the prompt
Examples	Prefix-Tuning (Li & Liang, 2021); Prompt Tuning (Lester et al., 2021)	APE → OPRO → DSPy → GEPA

Soft tuning is a legitimate cousin and lives in the weights lane we just left. From here on, "prompt optimization" means the right-hand column - discrete prompt optimization, words only, weights frozen.

4.2APE: the prompt becomes a search target

The lane opens in late 2022 with the Automatic Prompt Engineer (Zhou et al., 2022, ICLR 2023). The setup is almost embarrassingly simple. Show an LLM a few input/output examples and ask it to infer the instruction that would produce them. Score each candidate by how well it performs on a held-out batch. Keep the best. The paper's title was the thesis - "Large Language Models Are Human-Level Prompt Engineers" - and the receipts backed it up: APE-found instructions matched or beat human-engineered prompts on 19 of 24 NLP tasks.

What APE quietly proposed was a frame the rest of this chapter inherits. The instruction is the program. The benchmark score is the fitness. The LLM is the operator that proposes new programs. Once you accept that frame, every classical optimization technique comes back on the table - in text.

Interactive · propose → score → select run a generation

4.3OPRO, EvoPrompt, PromptBreeder: the loop tightens

A year later, OPRO - Optimization by PROmpting - made the search trajectory explicit. Feed the optimizer-LLM a running history of (prompt, score) pairs and ask it to propose a better one. The score functions as the gradient; the LLM functions as the optimizer. This is also where the now-famous line "Take a deep breath and work on this problem step by step" was surfaced - an instruction no human wrote, and one that bought OPRO up to +8% over human prompts on GSM8K and up to +50% on BIG-Bench Hard tasks (Yang et al., 2023, ICLR 2024).

EvoPrompt wrapped the whole machine in a genetic algorithm - a population of prompts, with the LLM acting as both the mutation and the crossover operator. On BBH it beat human-engineered prompts by up to +25% across 31 datasets (Guo et al., 2023, ICLR 2024). A week or two later DeepMind's PromptBreeder went one level meta and evolved not just the task-prompts but the mutation-prompts that mutate them - self-referential evolution that beat vanilla chain-of-thought on reasoning tasks (Fernando et al., 2023).

CoT seed, second growth

Chain-of-thought arrived in Ch 2 as a prompt trick - a hand-written line that made the model show its work. Here it becomes a target: APE, OPRO, and EvoPrompt all search over CoT-style instructions, and PromptBreeder explicitly evolves them. The same idea - make the model think before it answers - gets mechanized. In Ch 5 it grows once more, and this time it ends up in the weights.

4.4DSPy: program, don't prompt

Up to this point we've been searching for a single magic string. DSPy changed the unit of work (Khattab et al., 2023). Its slogan is "program, don't prompt" - you write a small program of declarative modules (generate_query → retrieve → answer) and a compiler figures out the actual prompts. The reported gain was substantial: compact DSPy programs beat standard few-shot prompting by >25% on GPT-3.5.

The workhorse optimizer behind it is MIPROv2 (Opsahl-Ong et al., EMNLP 2024; the paper introduces "MIPRO," and "v2" is the implementation name shipped in DSPy). The crucial design choice is that MIPROv2 jointly optimizes the instructions and the few-shot demonstrations of every module - which matters more than it sounds, because in practice the demonstrations usually carry more of the signal than the instruction wording. Up to +13% over baseline optimizers on diverse tasks with Llama-3-8B.

What MIPROv2 is optimizing, in symbols

With $\mathcal{P}$ the prompt program (instruction string $I$ plus demo set $D$), $\mathcal{M}$ the frozen LM, $\mathcal{X}$ a task batch, and $\mu$ a metric:

$\mathcal{P}^\star \;=\; \arg\max_{I,\,D \subset \mathcal{D}}\; \mathbb{E}_{x \sim \mathcal{X}}\Big[\,\mu\!\big(\mathcal{M}(I, D, x),\, y^\star(x)\big)\Big]$

No rollout ever updates $\mathcal{M}$. The only variables that move are the words in $I$ and the choice of $D$.

4.5DRPO and SPO: lose the human, lose the labels

DRPO - Dynamic Rewarding with Prompt Optimization - is the moment the prompt lane stops being a productivity aid and starts looking like an alignment method. It performs tuning-free self-alignment: the model adaptively rewards-and-edits its own alignment instructions via search, with no SFT and no RLHF anywhere in the pipeline. Base models steered by DRPO outperformed their own SFT- and RLHF-tuned counterparts, and the auto-found prompts beat ones written by human experts (Singla et al., EMNLP 2024). Read that again: aligning by prompt, beating aligning by training. The lanes are now racing.

SPO - Self-Supervised Prompt Optimization - cut the next dependency. Earlier methods needed ground-truth labels or human judgments to score candidates. SPO instead lets the model compare its own outputs against each other, with no external supervision, and still landed top-2 across benchmarks at 1.1 – 5.6% of the cost of the next-best methods (Xiang et al., 2025). Look back at what each step deleted. APE needed a labeled task. OPRO needed only the trajectory. DRPO needed no alignment training data - no SFT pairs, no preference set. SPO needs no ground truth at all.

The same machine, written in text

The prompt lane has quietly rebuilt reinforcement learning out of language. The reward is a metric or an LLM judge. The policy gradient is an LLM proposing an edit. The rollouts are candidate prompts. If it's the same machine, can it win?

4.6GEPA: reflection beats a scalar reward

The answer, as of ICLR 2026, is yes. GEPA - Genetic-Pareto - is a prompt optimizer with two deliberate departures from blind evolution (Agrawal et al., ICLR 2026 oral).

First, it reflects. After each system-level rollout, GEPA asks the LLM to read its own attempt and write a natural-language diagnosis of why the output fell short - then proposes a targeted edit aimed at exactly that failure. A scalar reward says only "that was a 0.3." A sentence of feedback - "you rambled before stating the answer; lead with the conclusion" - carries vastly more bits and points straight at the fix.

Second, it keeps a Pareto frontier of candidates over the task distribution - the best prompt for math may be a different prompt than the best one for code - so the search never collapses to a single brittle winner and can combine complementary lessons.

The numbers, honestly

Across 6 tasks, GEPA beats GRPO (the RL recipe from Ch 3) by +6% on average and by up to +20% on its best task, using up to 35× fewer rollouts. Against MIPROv2 the average gap is >+10%, including +12% on AIME-2025. The headline tail (+20, 35×) is the one everyone quotes; the +6% average is the one a reviewer would write down.

Interactive · reflective evolution reflect & evolve

4.7The eval shadow: who judges the judge?

We've buried something quietly across the whole chapter. Every method here needs a way to score a candidate. For APE on a labeled task that's accuracy. For DRPO it's win-rate against a baseline. For SPO and increasingly for everything else it's an LLM acting as judge. And the moment your fitness function is itself a model, you have a Goodhart problem - the search will happily find prompts the judge loves and humans don't.

One robust fix is to refuse to score on an absolute scale at all. Ask only the easier pairwise question - "which of these two is better?" - and let Elo turn those choices into a ranking. Pairwise is what reward models from Ch 2 were trained on, and it travels: a calibrated LLM judge plus a thin layer of human pairwise feedback is a surprisingly hard signal to game.

Interactive · preferences → a score pick the one you prefer

The capstone in Ch 7 makes this concrete - the voice-rewriter we'll build there uses pairwise clicks → Elo as its fitness signal precisely to keep the judge honest. Park that thought.

4.8Where this leaves us

By 2025 the picture is clean enough to summarize in a small table. The methods of this chapter are not variations on one trick - they each delete a different dependency the previous one assumed you'd keep.

Method	Year	Mechanism	What's new
APE	2022	propose → score → select	prompt as search target
OPRO	2023	LLM optimizes given history of (prompt, score)	trajectory as gradient
EvoPrompt / PromptBreeder	2023	evolutionary algorithm; PB also evolves mutators	population + meta-evolution
DSPy / MIPROv2	2023 / 2024	compiler over a multi-module program; joint search over instructions and demos	multi-step programs, demos matter
DRPO	2024	tuning-free self-alignment via search	no SFT, no RLHF
SPO	2025	self-supervised pairwise comparisons	no ground truth
GEPA	2025 / 26	reflective edits + Pareto frontier of candidates	natural-language feedback beats scalar reward

Three honest caveats keep this from being a romance. A long evolved prompt rides along as inference-time context cost on every request. The whole game leans on an already-capable base model - you cannot prompt-optimize knowledge that isn't latent in the weights. And a search that optimizes against a judge can corrupt it - the Goodhart villain from earlier chapters, still lurking.

The whole chapter in one sentence

For three years we aligned models by training them; somewhere between APE and GEPA we discovered we could often get there by searching over the words instead - faster, cheaper, and with the optimization left in a form a human can still read.

Handoff to Chapter 5

We have done as much as freezing the model lets us do. The next move is the one that unfreezes it again, but on purpose. Chain-of-thought entered as a prompt trick (Ch 2), became something prompt search could optimize (this chapter), and is about to be trained directly into the weights via RL on verifiable rewards - the o1 and DeepSeek-R1 line of work. Same idea, third life. The model is about to learn to think.