AI/LLM Mastery · Part 9 of 20 — we built the engine in Part 8, but its dials started random, so it spoke gibberish. Pretraining is the enormous one-time phase that fills those dials with the patterns of human language. The data, the objective, the cost, and what emerges.
An empty brain
By the end of Part 8 we had a complete GPT — tokeniser, embeddings, a deep stack of decoder blocks, causal masking, an output head, the lot. And it was useless. Every weight in it started as a random number, so asking it to continue “The cat sat on the…” would produce something like “the the qx blarg the.” We built the brain’s wiring but left it empty. This part is about filling it.
That filling process is called pretraining, and it is where essentially everything an LLM knows about language and the world comes from. It is also, by a wide margin, the most expensive and dramatic step in the whole pipeline. We will look at four things: the data it learns from, the objective it optimises, the staggering compute it takes, and what actually emerges as it runs. By the end you will understand both why these models are so capable and why so few organisations can build one.
Where pretraining sits
First, where pretraining fits in the life of a model.
A freshly-built GPT has random dials and outputs noise. Pretraining runs the exact training loop from Part 4 — forward pass, loss, backpropagation, update — over a colossal amount of text, nudging every dial trillions of times. What comes out the other side is a base model (also called a foundation model): the same architecture, but with dials now saturated in the patterns of human language, so it predicts and continues text fluently. The “pre” in pretraining means it happens before any task-specific tuning (which we cover in Parts 12–13). This single phase is the well it all draws from.
The fuel: trillions of tokens
The only ingredient pretraining needs is text — because the objective, as we will see, is just predicting the next token. The catch is the amount.
The text comes from a giant, mixed corpus: web crawls (most famously Common Crawl, a public scrape of much of the open web), books, Wikipedia, news, forums and large quantities of code. Mixing sources matters — it exposes the model to many styles, domains and languages. The scale is hard to picture: measured in tokens (Part 2), GPT-3 was trained on roughly 300 billion tokens; today’s frontier models train on several trillion. That is far more text than any human could read in a thousand lifetimes. And quality counts as much as quantity: raw web text is full of junk, spam and duplicates, so it is heavily filtered and deduplicated first — garbage in, garbage out. That whole data pipeline is involved enough to get its own article, Part 11.
The objective: predict the next token
With the text in hand, what exactly is the model trying to do? Nothing new — it is the same objective from Part 1 and Part 8: predict the next token.
Slide along the corpus; at each position, hide the next token, let the model predict a probability for every token in the vocabulary, and look up the probability it assigned to the actual next word. If it guessed confidently, good; if it was caught off guard, that is bad. The cross-entropy loss is simply a precise measure of that surprise — low when the true word was predicted well, high when it was not. Crucially, this is self-supervised (Part 1): the “label” for every position is just the next word already sitting in the text, so no humans label anything. That is what makes training on trillions of tokens possible at all.
for batch in corpus: # trillions of tokens, in chunks preds = gpt(batch) # forward: predict next token at every position (Part 8) loss = cross_entropy(preds, actual_next_tokens) # how surprised? grads = backward(loss) # backprop (Part 4) update(weights, grads, lr) # nudge every dial downhill
That is the entire training program. One humble objective, applied an astronomical number of times. As in Part 1, the model only gets good at the guessing game by quietly absorbing grammar, facts and reasoning along the way.
The price: thousands of GPUs, for months
That innocent-looking loop hides a brutal amount of arithmetic.
Each dial-nudge is cheap, but you run a full forward and backward pass for every token, across trillions of tokens, through billions of parameters. Multiply it out and the numbers are astronomical. No single machine can do it in a reasonable time, so pretraining runs across thousands of specialised chips (GPUs or TPUs) in parallel, often for weeks to months without stopping. The electricity and hardware add up to millions of dollars for one large model — which is precisely why only a handful of well-funded labs train frontier base models from scratch. The one consolation: this cost is paid once, up front. Adapting the finished model (Part 12) and using it are cheap by comparison. The base model is the heavy capital investment.
The loss curve: fast wins, then a grind
If you watch the loss during all this, it traces a remarkably consistent shape.
It starts high (the random model is hopeless), then drops steeply as the model snaps up the easy, high-value patterns — common words, spelling, basic grammar. After that the curve flattens into a long, slow grind, where each further improvement demands far more data and compute: diminishing returns. The striking thing is how smooth and predictable this curve is. It is regular enough that researchers can forecast a model’s final loss before training even finishes — a predictability so reliable it became a field of its own, the scaling laws of Part 10. And the loss is not just a number: lower next-token loss reliably means a better model at essentially everything.
What emerges: knowledge as a side-effect
So as that loss falls, what is the model actually gaining? The capabilities tend to appear in a rough order.
First the surface patterns: spelling, punctuation, common words, basic grammar — the frequent, cheap stuff that predicting text rewards immediately. Then deeper structure: world facts, syntax, translation, simple question answering — because to predict text about the world, a model is forced to absorb the world that text describes. And at sufficient scale, models show genuinely surprising emergent abilities that smaller ones simply lack: multi-step reasoning, arithmetic patterns, and “in-context learning” (following examples you put in the prompt). These can appear rather suddenly as the model crosses a size threshold. Underneath, it is all one thing — the model compressing the patterns of human text into its dials. Push next-word prediction hard enough, on enough data, and what falls out looks a great deal like knowledge.
The catch: a base model is not an assistant
There is one twist that surprises almost everyone, and it matters for everything that follows.
A pretrained base model is a magnificent text-continuer, but continuing text is all it learned to do — it is not yet a helpful assistant. Prompt a raw base model with “What is the capital of France?” and it may not answer at all; it might reply “What is the capital of Germany? What is the capital of Italy?” — because in its training data, lists of questions are often followed by more questions, and it is faithfully continuing that pattern. The unsettling part: the model knows the answer is Paris; the fact is sitting in its dials. It simply was never taught the behaviour of answering helpfully. Bridging that gap — from “knows things” to “acts like a helpful assistant” — is the job of fine-tuning and RLHF in Parts 12–13.
Pretraining, and the question it raises
Pretraining, then, is one big idea executed at an almost absurd scale.
One self-supervised objective (predict the next token), trillions of tokens of filtered text, the Part 4 training loop, weeks on thousands of GPUs costing millions — and out comes a base model whose dials hold the patterns of human language. The random brain from Part 8 is now full.
That smooth, predictable loss curve, though, raises a very practical and very expensive question. Say you have a fixed budget of compute to spend on pretraining. Should you spend it on a bigger model, or on training a smaller model for longer on more data? Get that trade-off wrong and you waste millions. The answer turns out to follow precise mathematical rules — the scaling laws — and getting them right (the famous Chinchilla result) is what Part 10 is all about.