Chapter 0 · 2017 – 2019 · lever: architecture
Where you left off
The last time you looked, NLP was a zoo of bespoke models - LSTMs with attention bolted on for translation, CNNs for sentence classification, hand-engineered features everywhere else. By the end of 2019 the zoo was gone, replaced by one architecture and one recipe: a Transformer, pretrained on a pile of text, fine-tuned for whatever you needed. This chapter re-anchors you at that swap. We walk the 2017 paper that made it possible, then the 2018–2019 split between BERT and GPT that decided what "pretraining" would mean for the decade. By the end you'll be standing exactly where GPT-2 stood at the close of 2019 - on the doorstep of scale.
1.One architecture, no recurrence
June 2017. Eight authors at Google publish a paper with a cocky title and a quiet abstract: "Attention Is All You Need" (Vaswani et al., 2017). The headline result is a new state of the art on WMT 2014 English–German translation, 28.4 BLEU - two points over the previous best - trained in 3.5 days on eight P100 GPUs. The architecture is "based solely on attention mechanisms, dispensing with recurrence and convolutions entirely."
Read that sentence with 2019 eyes and it sounds reckless. The seq2seq machinery you knew was an LSTM encoder feeding an LSTM decoder, with an attention layer attached to help the decoder peek back at the source. Bahdanau et al. (2014) added the attention; Vaswani et al. throw away everything else and keep only it. The stack that remains is just self-attention and position-wise feed-forwards, repeated.
Project every token into three vectors - query, key, value - then let every position aggregate a weighted mix of every other position's value, weighted by how well its query matches the other's key:
$\mathrm{Attention}(Q,K,V) = \mathrm{softmax}\!\left( \htmlData{tip=scale the dot product by sqrt of d_k so the softmax does not saturate when d_k is large}{\frac{QK^\top}{\sqrt{d_k}}} \right) V$
Multi-head attention runs h = 8 of these in parallel on different learned subspaces, concatenates the outputs, and projects once more. Each head can specialize.
The "why" is mostly about hardware. An RNN must finish position i before it can start i+1 - the sequence is the unrolling. Self-attention has no such dependency: every position attends to every other in a single matmul. On a GPU, that difference is the whole game.
The accounting underneath: self-attention is $O(n^2 \cdot \htmlData{tip=model hidden dimension; d_model = 512 in the base Transformer}{d})$ floating-point work per layer but only $O(1)$ sequential operations, while a recurrent layer is $O(n \cdot d^2)$ work and $O(n)$ sequential (Vaswani et al., 2017, §4). Quadratic-in-sequence cost is the price the paper pays for total parallelism. In the GPU era, that trade was a gift.
What the block looks like up close
One encoder block is a small, repeatable unit - the kind of thing that gets stacked N = 6 times in the base model. Hover any sub-block.
Two pieces deserve a flag because they're what makes deep stacks trainable, not just expressible. The positional encoding added at the bottom is non-negotiable: without it, swap any two input tokens and self-attention gives the same answer (the operation is permutation-equivariant). And the residual connections threading around every sub-layer let gradients reach the bottom of a deep stack - lift them out and training falls apart.
Convolution bakes in spatial locality. Recurrence bakes in left-to-right time. Self-attention bakes in nothing - any token can attend to any other. The Transformer is the reader's first concrete instance of what Rich Sutton called the bitter lesson: general computation plus more compute beats clever inductive bias. It's a pattern you'll see again in every chapter that follows.
2.Pretrain once, fine-tune cheap
The Transformer arrived at the same moment NLP was figuring out a separate trick: transfer learning. Computer vision had been doing it for years - train a CNN on ImageNet, lop off the head, fine-tune on your tiny task dataset. NLP held out longer because words don't have an ImageNet, but by 2018 two papers showed what the language-side version looked like. ULMFiT (Howard & Ruder) pretrained a language model and fine-tuned it for text classification with a discriminative learning-rate schedule. ELMo (Peters et al.) showed that the hidden states of a bidirectional LM were a far better source of word representations than fixed embeddings like word2vec or GloVe.
Both were stepping stones. The shape the field actually settled on was set by two models that landed within four months of each other, both built on the Transformer.
GPT-1: a decoder that just predicts the next token
In June 2018 OpenAI released a tech report titled Improving Language Understanding by Generative Pre-Training (Radford et al., 2018). The recipe: take the decoder half of a Transformer, train it on next-token prediction over BookCorpus (roughly 7,000 unpublished books), then fine-tune the whole thing on each downstream task with a small task-specific head. Twelve layers, 768 hidden, around 117 million parameters. It beat the prior state of the art on 9 of the 12 NLU tasks the paper studied.
The shape was simple and consequential: a decoder-only Transformer, looking only at tokens to its left, generating one token at a time. That asymmetry - the model can only see the past - is what makes it good at generation.
BERT: an encoder that fills in blanks
Four months later, Google publishes BERT (Devlin et al., 2018) and the score tables collapse: +7.7 on the GLUE average, 93.2 F1 on SQuAD v1.1, new state of the art on eleven tasks at once. The architecture is the encoder half of the Transformer, but the real move is the training objective. Instead of predicting the next token autoregressively, BERT masks 15% of the input tokens at random and asks the model to fill them back in (Devlin et al., NAACL 2019).
Given a sentence with some positions replaced by [MASK], maximize the
log-likelihood of the true tokens at those positions, conditioned on the entire
unmasked context to the left and right:
$\mathcal{L}_{\mathrm{MLM}} = -\sum_{i \in \mathcal{M}} \log p_\theta\!\left( x_i \mid \htmlData{tip=the input sequence with all masked positions hidden — the model sees both left and right context}{x_{\setminus \mathcal{M}}} \right)$
Bidirectional context is the unlock. Next-token prediction can only ever look left; MLM looks both ways, which is why BERT's representations are stronger for understanding tasks.
BERT-base is 110M parameters; BERT-large is 340M. Both are pretrained on BookCorpus plus English Wikipedia (about 3.3B words total), and after pretraining you fine-tune the whole model with a one-layer classification head on whatever task you need (Devlin et al., 2018, §3). The fine-tuning takes minutes to hours on a single GPU. The pretraining you only do once.
The split that decided what came next
GPT and BERT are the same machine pointed in two directions. The contrast looks small on paper and turned out to define the next half-decade.
| BERT (Oct 2018) | GPT-1 / GPT-2 (Jun 2018 / Feb 2019) | |
|---|---|---|
| Shape | Encoder-only, bidirectional | Decoder-only, left-to-right |
| Objective | Mask 15% of tokens, predict them | Predict the next token, always |
| Natural fit | Classification, QA, retrieval - understanding | Text generation - writing |
| Sizes shipped | 110M / 340M | ~117M / up to 1.5B |
| Use pattern | Pretrain, then fine-tune on every new task | Pretrain, then fine-tune - soon, just prompt |
For a year, BERT looked like the winner. It dominated every leaderboard a CV person would recognize - the GLUE benchmark, SQuAD, classification tasks of every shape - and the academic community fanned out into a small industry of variants (RoBERTa, ALBERT, DistilBERT). GPT-1 was a strong paper that got less attention. The conventional wisdom said: encoders are how you do NLP now.
The conventional wisdom was about to be wrong.
3.The doorstep of scale
February 2019. OpenAI announces GPT-2 and does something the field had not seen: they decline to release the full model, citing misuse risk. They publish a paper titled "Language Models are Unsupervised Multitask Learners" (Radford et al., 2019), push out a 124M model, then a 355M, then a 774M over the year, and finally release the full 1.5B-parameter version on 5 November 2019. Same shape as GPT-1 - decoder-only Transformer, next-token prediction - 48 layers deep, trained on roughly 40 GB of text scraped from outbound Reddit links with at least three karma (the WebText corpus).
Two findings from the GPT-2 paper landed harder than the model itself.
The first was zero-shot task performance. Without any task-specific fine-tuning - no labeled examples, no gradient updates - GPT-2 set a new state of the art on 7 of 8 language-modeling benchmarks just by reading a prompt that described the task. Translation, summarization, question answering: ask in plain English, get an answer back. Quality was rough, but the curve was unmistakable. Task structure was being learned implicitly, from the data, without anyone telling the model what the tasks were.
The second was the staged release itself. It was the first time a frontier lab said, out loud, "this model is too dangerous to release." That sentence is going to come back - it's the seed of the entire alignment story in Chapter 2.
By late 2019 a clean picture has formed. One architecture - the Transformer - replaces RNNs, CNNs-bolted-onto-attention, and every task-specific contraption. Pretraining on raw text replaces hand-labeled supervision. Fine-tuning replaces architecture design. The question stops being "what model for this task?" and starts being "how big, on what data, to do what?" Attention, in the literal sense, ate everything.
GPT-2 also leaves a thread dangling. Zero-shot worked - barely. Fine-tuning still beat it on every task that mattered. The decoder-only side of the family looked like an interesting curiosity; the encoder side looked like the future. A reasonable bet at the close of 2019 would have been that scaling BERT was the next move.
The next chapter is what happened instead.