Scale is a strategy - From Attention to Agents

Chapter 0 ended with a bet: one architecture plus pretraining could replace the zoo of bespoke models. This chapter is what happens when you press the "more" button on that bet for three years. The deliverable is not a smarter model - it is a smarter way to plan. By the end, prompting has replaced fine-tuning as the default interface, capability has a power-law graph, Chinchilla has caught the field undertraining its giants, and the transformer has crossed into vision. The chapter closes on the gap that triggers Chapter 2: GPT-3 is powerful and deeply unhelpful.

1. The shift in posture

The headline finding of Brown et al. 2020 is buried in one line of their abstract: GPT-3 "is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model." Read that twice. The model is frozen. The task is described in English. The training signal - if you can still call it that - is a handful of examples typed into the context window. This is in-context learning, and it broke the loop everyone had been running since the BERT era.

Before GPT-3, the recipe for a new task was: take a pretrained model, fine-tune it on your labels, ship the per-task checkpoint. After GPT-3, the recipe was: take the one model, describe the task in English, paste a few examples in front of your input, read the output. The verb changed. You no longer train a model to do a task. You prompt one.

The interface shift

Programming a model became writing for a model. That is the sentence the rest of this course is a slow-motion reaction to. Once the prompt is the interface, "the prompt" becomes something worth studying, something worth optimizing, eventually something worth searching over. Chapters 2, 4, and 7 all live in this consequence.

1.1 In-context learning, on one screen

Same instruction, same query, three different prompts - the only thing that changes is how many demonstrations sit in the context. The model output is canned for this widget; the point is the shape of the conditioning, not a live inference.

In-context learning · 0, 1, 3 shots click a shot count

Few-shot prompting is not "the model memorized this in pretraining." Brown et al. showed the effect grows monotonically with the number of demonstrations and with model scale, on tasks engineered to be novel. The base model already contains the competence; the demonstrations choose which competence to evoke. Think of the prompt as a selector over a huge superposition of latent capabilities.

2. The shape of the bet

The shock of GPT-3 was not just what it could do - it was that the result was predictable. Six months earlier, Kaplan et al. 2020 had published the curves. Loss as a power law in compute. Loss as a power law in parameters. Loss as a power law in data. Each clean, each spanning multiple orders of magnitude. From the abstract, verbatim: "The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude."

Definition · power law

A relationship of the form $L(C) = A \cdot C^{-\alpha}$ - a straight line on log-log axes. The exponent $\alpha$ tells you how much loss drops per decade of compute. Kaplan reports $\alpha_C \approx 0.05$ for compute, $\alpha_N \approx 0.076$ for parameters, $\alpha_D \approx 0.095$ for data, fit across more than seven orders of magnitude.

What is load-bearing here is not the specific exponent. It is the existence of the line. Before scaling laws, "make the model smarter" was research alchemy: try things, see what works, publish. After scaling laws, it is a budget calculation. Pick a compute budget, read the corresponding loss off the line, decide if the gain is worth the money. Capability becomes a function you can plan against.

GPT-3 was the existence proof attached to that line. 175 billion parameters - "10x more than any previous non-sparse language model," in the paper's own framing - trained on hundreds of billions of tokens, sampled with weighting from a corpus of roughly 570B tokens. The compute bill was on the order of thousands of petaflop/s-days. Numbers that would have sounded absurd in 2019 read, in retrospect, as exactly the dose Kaplan's exponents predicted.

2.1 The line, on log-log axes

Slide the budget. Both curves obey the same power law - $L = A \cdot C^{-\alpha}$ - with the same exponent. They differ only in their prefactor $A$, which is set by how you spend the budget across parameters and tokens.

Loss vs compute · Kaplan vs Chinchilla allocation drag the slider

The orange curve traces Kaplan's prescription: pour most of the budget into parameters, keep data modest, stop before convergence. The teal curve traces what Chinchilla would later find is actually compute-optimal: scale params and tokens together. Same exponent, lower prefactor, lower loss at every budget. The gap is the chapter's pivot - and the next section's subject.

3. Chinchilla, or: eat your spinach

Two years of giants followed Kaplan. Gopher (280B). Megatron-Turing NLG (530B). Jurassic-1 (178B). All of them training on token counts roughly comparable to GPT-3's. All of them, it turned out, undertrained.

Hoffmann et al. 2022 at DeepMind trained more than 400 models across a wide range of (params, tokens) combinations, fit a joint scaling law to all of them, and read off the compute-optimal frontier. Their punchline: for every doubling of model size, the number of training tokens should also double. Params and data scale together. Kaplan's "spend on params, hold data modest" was an artifact of the corner of the budget surface he had sampled.

The proof was a model named Chinchilla: 70B params, 1.4T tokens - same compute budget as Gopher's 280B params on 300B tokens. Quarter the parameters, four times the data. Same money. The 70B model uniformly and significantly outperformed Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing (530B), including 67.5% on MMLU - seven points above Gopher.

What Chinchilla actually changed

Not the existence of the scaling law. Just the ratio. The 2020 giants had been overweighting params and underweighting data. Re-anchored on the right ratio, the same hardware buys a noticeably better model. After Chinchilla, every serious lab refits its data pipeline first and its parameter count second. LLaMA 1, Mistral, and the whole open-weights wave from Chapter 3 all live downstream of this single number.

3.1 The two prescriptions, side by side

Rule	Kaplan (2020)	Chinchilla (2022)
Where to spend compute	Mostly parameters	Parameters and tokens equally
Train to convergence?	No - stop early	Yes - feed more tokens
Reference 70B model would use	~140B tokens	~1.4T tokens
Verdict, in hindsight	Undertrained	Compute-optimal

A subtle but important thing the table does not show: Chinchilla did not refute Kaplan, it refit him. The power-law form survives. The mistake was in the corner of (N, D) space that had been sampled. The lesson is methodological as much as numerical - if your scaling law was fit on undertrained models, do not extrapolate it.

4. Naming the noun: foundation models

By mid-2021, a strange thing had happened. The same pretrained model was being adapted - via fine-tuning, prompting, or both - for translation, summarization, code generation, search, classification, and a long tail of bespoke tasks. One model underneath, hundreds of applications on top. The field needed a noun for that thing.

The Stanford CRFM report (Bommasani et al. 2021, 114 authors) crystallized it. From the opening line: "AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character." The label has stuck.

A naming caveat

"Foundation model" was popularized, not coined, by the CRFM report - the term had appeared in adjacent contexts before. Safer phrasing for the historical record: the noun was crystallized by Stanford in August 2021. What mattered for the field was less the etymology and more that everyone now had the same word for the substrate.

5. The transformer walks into vision

The same two years quietly settled an older debate. Convolutions had ruled vision since 2012; attention had ruled NLP since 2017. Two papers stapled them together by noticing that the staple was unnecessary.

ViT (Dosovitskiy et al. 2020): split an image into 16x16 pixel patches, treat each patch as a token, feed them into a vanilla transformer encoder. From the paper: "a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks." With enough pretraining data, ViT matched or beat the best CNNs - using none of the convolutional inductive bias people had spent a decade engineering in.

CLIP (Radford et al. 2021) went further. Scrape 400 million (image, caption) pairs from the web. Train one encoder for images and one for text using a contrastive objective - pull matching pairs together in a shared embedding space, push mismatched ones apart. The result was zero-shot ImageNet at ResNet-50 accuracy, without ever seeing the 1.28M ImageNet training images. Vision had learned to speak the same prompt-able dialect.

For the CV reader

If your last touchpoint with vision was 2019 ResNets and EfficientNets, this is the era where the convnet stops being the only game. ViT does not retire the CNN - convolutions remain excellent on small-data regimes - but it shows the same architecture from Chapter 0 generalizes across modalities once you give it enough data. Multimodality is a recurring thread, not a separate chapter. It returns in Chapter 7 when the native-multimodal frontier shows up.

6. The gap that opens Chapter 2

GPT-3 is a 175-billion-parameter document completer. That is the precise framing to carry into the next chapter. Ask it "explain photosynthesis to a child" and, half the time, you do not get an explanation - you get a continuation that looks like a textbook table of contents, or a Q&A list, or someone else's half-finished essay. The model has not been trained to be helpful. It has been trained to predict the next token of arbitrary internet text, and helpful text is a small fraction of that distribution.

The gap has a name: the intent-alignment gap. The capability is there. The behavior is not. Closing that gap is the work of Chapter 2 - it is the work of alignment, and the first solution will be a three-stage pipeline called RLHF.

Chapter pivot · the levers so far

Chapter 0 turned the architecture lever - one stack of self-attention replaced a zoo. Chapter 1 turned the scale lever - pretraining harder unlocked in-context learning and made capability a budget calculation. Chapter 2 turns the alignment lever - shaping a knowledgeable but unhelpful base model into an assistant. Same model. New behavior. No new architecture.

7. What to take with you

Prompting replaced fine-tuning as the default interface. Few-shot examples in the context window steer a frozen model. The verb changed from "train" to "prompt."
Capability became predictable. Loss is a power law in compute, params, and data. Bigger-is-better stops being a hunch; it is a graph you can read.
The ratio matters. Chinchilla showed that the 2020-era giants were undertrained - data and params should scale together, not just params.
One model became the substrate. "Foundation model" is the noun the field needed once everything downstream was a thin wrapper on the same pretrained backbone.
Vision joined. ViT and CLIP show the transformer is not a text architecture - it is a token architecture. Patches are tokens.
The villain enters. A powerful base model is not a helpful one. Chapter 2 fixes that.

Sources: Brown et al. 2020 (arXiv:2005.14165); Kaplan et al. 2020 (arXiv:2001.08361); Hoffmann et al. 2022 (arXiv:2203.15556); Bommasani et al. 2021 (arXiv:2108.07258); Dosovitskiy et al. 2020 (arXiv:2010.11929); Radford et al. 2021 (arXiv:2103.00020).