Chapter 6 · 2024 – 2026 · lever: agency

Learning to act

A model that can think still can’t do anything in the world. This is the move from chatbot to agent - tools, loops, environments, errors.

ReAct Tool use MCP SWE-bench Computer use

A reasoning model can run a chain of thought all day and still not move a pixel. To affect the world it needs two missing pieces - a way to call out (tools) and a way to listen back (observations). Wire those into a loop and the chatbot becomes an agent. This chapter is that wiring: the ReAct loop as the spine, three answers to "how do tools work" (research, API, open standard), the wild 2023 wave of long-horizon autonomy, an eval that grew up alongside it, and the moment the model got a mouse and keyboard.

1.The loop that defines an agent

The cleanest definition of an agent is also the smallest: a model that interleaves thinking, acting, and observing, in a loop, until the task is done. That pattern has a name and a paper. ReAct - "Synergizing Reasoning and Acting in Language Models" - proposed it in October 2022 and the design has barely shifted since. (Yao et al., ICLR 2023.)

The trick is that the "thought" step is just chain-of-thought reused inside a control loop - the same seed planted back in Chapter 2 and trained into the weights in Chapter 5, now placed where it does mechanical work: deciding the next action. The "observation" is whatever the world hands back when you act on it: a search result, a stack trace, a screenshot. The loop closes because each observation feeds the next thought.

Interactive · ReAct stepper step through the cycle

The numbers in the paper are worth pausing on, because they justify why this pattern took over. On ALFWorld, a text-based household-task benchmark, ReAct beat imitation- and RL-trained baselines by +34 absolute success-rate points. On WebShop it beat baselines by +10. Both numbers came from one or two in-context examples - no fine-tuning, no new weights. The agency was sitting in the prompt all along; the loop just let it out.

CoT, third appearance

The same "show your work" move shows up here a third time. As a prompt trick (Chapter 2), as a target for prompt optimization (Chapter 4), as something trained directly into the weights (Chapter 5) - and now as the thought step inside a control loop, where each line of reasoning ends in an action instead of an answer.

2.Three answers to "how does it call a tool?"

A loop is only as useful as the things the model can do inside it. The story of tool use is a story of three answers - one from research, one from product, one from standards - arriving roughly a year apart.

The research move: teach the model when to call

Toolformer (Meta AI, February 2023) showed a base LM can self-supervisedly learn when an API call would help: insert a candidate call into its own training data, check whether the call reduces the downstream loss, and keep only the helpful ones. Calculator, search, QA, translation, calendar. The model learns to reach for an external tool the way it learned to predict the next word. (Schick et al., NeurIPS 2023.)

The product move: a structured-JSON API

On June 13, 2023 OpenAI shipped function calling on gpt-4-0613 and gpt-3.5-turbo-0613 - a structured JSON name + arguments matching a developer-supplied schema. You describe your function, the model decides when to invoke it and emits clean JSON, your runtime executes it and feeds the result back. Tool use stopped being a prompt-engineering pattern and became a model capability with a fixed wire format. (OpenAI, June 2023.)

Function calling, in one round-trip
// developer registers a function schema
{
  "name": "get_weather",
  "parameters": { "location": "string", "unit": "string" }
}

// model emits, given "what's the weather in Seoul?"
{
  "name": "get_weather",
  "arguments": { "location": "Seoul, KR", "unit": "c" }
}

// developer runs the function; result is fed back as the next message
{ "temp": 14, "sky": "clear" }

The standards move: an open protocol for tools

Eighteen months later, on November 25, 2024, Anthropic introduced the Model Context Protocol (MCP), an open standard for how LLMs connect to tools and data sources. Python and TypeScript SDKs at launch; reference servers for Google Drive, Slack, GitHub, Git, Postgres, and Puppeteer. The point of a protocol - rather than another vendor API - is that every model can talk to every tool without bespoke glue, and the tool describes itself once. (Anthropic, Nov 2024.)

Toolformer Function calling MCP
What it is A training method A model API A wire protocol
Who builds the integration Model lab (during pretraining-ish) App developer per app Tool author, once
Touches weights? Yes No No
Ecosystem shape Fixed at training time Vendor-specific Tools as servers, any client

Notice the through-line. Research asked can a model learn to call APIs?; product asked can we make that a routine capability?; standards asked how do tools describe themselves to any model? Each move was needed; none of them, on their own, makes an agent.

3.The wild 2023 wave

The moment function calling shipped, builders started asking the obvious next thing - what if the model decides its own goals, splits them into sub-tasks, and runs the loop unattended? In spring 2023 AutoGPT wrapped GPT-4 in a self-prompting loop with web search and file I/O, decomposed a user goal, and ran without per-step approval. It became one of the fastest-starred repos in GitHub history. Most of the runs were unproductive - the loops chased themselves, ran up token bills, and got stuck - but the proof-of-concept was loud enough that "autonomous agent" became a category overnight.

The serious research counterpart arrived two months later. Voyager (NVIDIA / Caltech, May 2023) dropped an LLM agent into Minecraft with three pieces: an automatic curriculum that proposed the next task, an ever-growing skill library of executable code the agent wrote and reused, and an iterative prompting loop that fed environment errors and self-verification back into the next attempt. It was a code-writing ReAct agent for a sandbox world, and it learned faster and explored further than every prior LLM-in-Minecraft baseline. (Wang et al., 2023.)

Why the Voyager pieces stuck

Skill library, environment feedback, self-verification - these aren’t Minecraft mechanics. They’re the structural ingredients every later coding agent ended up with. Voyager looks, in retrospect, like a small-scale rehearsal of what Claude Code and Devin would later need.

4.The eval shadow grows up

Every chapter has an eval that defined the scoreboard. BBH defined Chapter 2, MMLU defined Chapter 3 - Chapter 6’s is SWE-bench: 2,294 real GitHub issues across 12 popular Python repos. The task is unglamorous and exactly right for agents - read the issue, edit the codebase, make the project’s hidden tests pass. (Jimenez et al., ICLR 2024.) At publication the best system tested (Claude 2 with a retrieval pipeline) solved just 1.96% of issues. The benchmark was effectively unsolved.

That number didn’t stay there. Within a year the field had to admit a separate problem: some SWE-bench problems were underspecified or had flaky test setups, so headline numbers stopped meaning much. On August 13, 2024 OpenAI released SWE-bench Verified: a 500-problem subset, each independently reviewed by three engineers, drawn from a pool of 1,699 candidates, with containerized Docker environments to make runs reproducible. (OpenAI, Aug 2024.) SWE-bench Verified became the headline metric in every frontier-model launch that followed.

And here the eval-shadow thread bites. A benchmark of self-contained "fix this issue, pass these tests" tasks is a beautiful target - and a partial one. It doesn’t capture long-horizon collaboration, code-review pushback, maintainability, or whether the patch would survive a real codebase six months later. Race the metric long enough and you get a model optimized for the metric. The villain across this whole course shows up here wearing a green test bar.

A second eval, briefly

For agents in browsers rather than repos, WebArena (July 2023) put 812 task instructions in a self-hosted environment - an e-commerce site, a Reddit clone, a GitLab, a CMS, plus utilities. At publication, GPT-4 agents reached roughly 14%, against ~78% for humans. (Zhou et al., 2023.) A different shape of task; the same gap.

5.The model gets a mouse and keyboard

Late 2024 collapsed three things into one beat. On October 22, 2024 Anthropic released computer use with the upgraded Claude 3.5 Sonnet - the model takes screenshots as input and emits mouse and keyboard actions, like a person at a desk. The launch numbers: OSWorld screenshot-only at 14.9% (next-best system at 7.8%, rising to 22.0% with more steps); SWE-bench Verified jumping from 33.4% → 49.0% over the previous Claude 3.5 Sonnet; TAU-bench retail 62.6% → 69.2%, airline 36.0% → 46.0%. (Anthropic, Oct 2024.)

OSWorld is the benchmark that justified the capability: 369 real-world Ubuntu (and some Windows) computer tasks, end-to-end, execution-graded. At publication humans cleared 72.36%, the best model managed 12.24%. (Xie et al., NeurIPS 2024.) Then on January 23, 2025 OpenAI shipped Operator, a research preview running on a new model called CUA (Computer-Using Agent) built on GPT-4o, also browser-and-screenshot driven. (OpenAI, Jan 2025.)

A benchmark, an Anthropic release, an OpenAI release - all inside about three months. The pattern of the chapter is clearest here: it isn’t one paper, it’s a cluster, and the cluster keeps moving.

6.The agent the reader already knows

The on-ramp the reader actually has is a coding agent. The category arrived in March 2024 with Devin from Cognition - pitched as the first AI software engineer, end-to-end: plan, write code, run tests, open a PR. Eleven months later, on February 24, 2025, Anthropic announced Claude 3.7 Sonnet, its first "hybrid reasoning" model with an extended-thinking mode, alongside Claude Code, a command-line research preview for agentic coding - delegate a real engineering task from the terminal and let the agent run. (Anthropic, Feb 2025.)

If you’re reading this on the site this course is being built with, you already know the shape of the loop. Claude Code is a ReAct agent with file I/O, shell access, MCP tools, and the extended-thinking mode of Chapter 5 dropped in at the "thought" step. The metaphor closes in a slightly recursive way - this chapter is being typed by an agent of the kind it describes.

7.What agency needed that conversation didn’t

Pull the chapter back to first principles. A pure chat model needs only enough state to finish the next message. An agent needs four things conversation never asked it to have, and the history above is largely a story of inventing them.

Four ingredients of agency
  • Memory. One context window isn’t enough for long-horizon work - agents need persistent state across steps. Voyager’s skill library, AutoGPT’s file I/O, MCP-mediated stores in 2024–2025.
  • Environment feedback. Execution results, stack traces, screenshot diffs - the "Observation" in ReAct, Voyager’s iterative prompting, every tool response that comes back as text.
  • Error recovery. Re-plan on failure instead of charging ahead. Voyager’s self-verification; Claude Code’s tool-error retries.
  • Planning. Decompose before executing. ReAct’s reasoning trace; extended-thinking modes applied to coding work.

That list is editorial - no single paper to cite - but each item points to a citation earlier in the chapter. The reason agency took until 2024–2025 to feel real is that all four ingredients had to land together. A model that could think (Chapter 5) but couldn’t remember, observe, recover, or plan was a clever conversationalist. Wire those in and you get the thing on your laptop.

Where the chapter leaves you

The arc so far: architecture (Ch 0), scale (Ch 1), alignment (Ch 2 – 3), prompt (Ch 4), test-time compute (Ch 5), and now agency. Six levers. The next chapter is the synthesis - what happens when you turn them all at once, and the small voice-tool you woke up wanting.