Scaling Laws

Part 10 of the AI/LLM mastery series — the maths of "bigger, more data, more compute". The three levers (parameters, tokens, compute), why loss falls as a power law with diminishing returns, the compute budget (C ≈ 6ND), the Kaplan-vs-Chinchilla story (and why GPT-3 was under-trained), the ~20-tokens-per-parameter rule, the inference-cost twist, and how predictability de-risks $100M training runs.

AI/LLM Mastery · Part 10 of 20 — the loss curve from Part 9 was so smooth it became a science. Scaling laws are the maths of "bigger, more data, more compute" — and the famous Chinchilla result that proved almost everyone had been doing it wrong.

The most expensive question in AI

Part 9 ended on a cliffhanger disguised as a chart. The pretraining loss curve was eerily smooth and predictable, and it left us with a very expensive question: if you have a fixed budget of compute, should you spend it on a bigger model, or on training a smaller model for longer on more data? Guess wrong and you set fire to millions of dollars. This part is the answer, and it is one of the few places in machine learning where vague intuition gives way to genuine, predictive maths.

These are the scaling laws: the rules that connect how big a model is, how much data it sees, and how much compute it burns, to how good it ends up. We will build them up gently — every symbol defined — and then walk through the plot twist that reshaped the entire field in 2022. By the end you will understand why a smaller model sometimes beats a bigger one, and why people can now forecast a hundred-million-dollar training run before it starts.

Three levers: parameters, data, compute

First, what actually moves the loss? Three things, and the scaling laws are the rules tying them together.

There are exactly three levers. N — the number of parameters (the dials from Part 4); a bigger model. D — the amount of data, measured in tokens (Part 2); more text. C — the total compute you spend (roughly, how much hardware for how long). Turn any one of them up and the loss goes down, smoothly and predictably. The scaling laws are simply the precise formulas relating N, D and C to the loss — and the surprise is that such simple relationships hold across many orders of magnitude.

It is a power law: diminishing returns

How does the loss fall as you scale? Not in a straight, steady line — it follows a power law, and understanding its shape saves you from very expensive disappointment.

A power law means each equal improvement costs exponentially more than the last. The tell-tale sign: plot loss against scale on a log-log chart (both axes in powers of ten) and the curve becomes a near-perfect straight line. In practical terms, to drop the loss by one notch you need roughly 10× more scale; for the next notch, another 10×. Going from a bad model to a decent one is cheap; squeezing a great model out of a good one costs a fortune. There is also an irreducible loss — a floor you can never cross, because some of what comes next in text is genuinely unpredictable (no model can be sure whether you will write “cat” or “dog”). Scaling buys you progress toward that floor, never past it.

The budget: compute ≈ 6 × N × D

Now the constraint that makes this a real decision. You never have unlimited everything — you have a fixed compute budget.

the budget, and the trade-off
# a standard approximation for transformer training compute (FLOPs):
C  ≈  6 × N × D
#        params   tokens

# For a FIXED C, N and D trade off: more params -> fewer tokens, and vice versa.

That little formula is the whole drama. Because compute is roughly proportional to parameters times tokens, a fixed budget locks N and D together: choose a bigger model and you can afford fewer tokens; choose more tokens and the model must shrink. You cannot maximise both. So the real question becomes sharp and quantitative: given my fixed C, what split of N and D gives the lowest loss? Two landmark papers gave opposite answers, two years apart.

Kaplan, 2020: "make it bigger"

The first answer, and the one that defined the early GPT era.

In 2020, Kaplan and colleagues at OpenAI published the original scaling-laws study and concluded: spend most of any extra budget on model size. Grow the parameters aggressively; grow the data more slowly. This guidance lit the fuse on the race to ever-bigger models — most famously GPT-3, at 175 billion parameters, trained on roughly 300 billion tokens. Do the division and that is under 2 tokens per parameter. As the next paper would show, that was nowhere near enough data for a model that large. GPT-3, the model that stunned the world, was actually badly under-trained.

Chinchilla, 2022: balance beats bigness

Then came the correction that rewrote the playbook.

In 2022, DeepMind’s Chinchilla paper (Hoffmann et al.) redid the analysis far more carefully and reached the opposite conclusion: for a fixed compute budget, you should scale parameters and data roughly equally — about 20 tokens per parameter, not under 2. Kaplan’s models had been starved of data. To prove it, they trained Chinchilla: only 70 billion parameters — smaller than GPT-3 — but on a whopping 1.4 trillion tokens (about 20 per parameter), using the same compute as a 175B-class model. The result was decisive:

same compute, opposite strategy
GPT-3-class   :  175B params  ×  300B tokens   →  ~1.7 tokens/param  (under-trained)
Chinchilla    :   70B params  ×  1.4T tokens   →   ~20 tokens/param  (compute-optimal)

# smaller, better-fed Chinchilla BEAT the bigger, under-fed model — at equal compute.
The lesson that reset the field: for years the bottleneck had not been model size at all — it was data. Balance beats raw bigness. "Chinchilla-optimal" (~20 tokens per parameter) became the default rule of thumb.

The twist: training cost vs running cost

There is one more wrinkle, and it is why today’s popular models are often even smaller than Chinchilla would suggest.

Chinchilla answered “best model for a fixed training budget.” But training is a one-time cost. There is another cost that never stops: inference — every time you actually run the model to answer a query. A bigger model is more expensive on every single call, forever. If you are going to serve a model to millions of users billions of times, it can pay to make it even smaller than compute-optimal, and train it even longer on even more tokens to compensate. That is exactly why models like Meta’s Llama (Touvron et al., 2023) train relatively small models on trillions of tokens — they are cheap and fast to run. So “optimal” is not one number: optimising the one-time training cost and the forever inference cost pull toward different model sizes, and most production models lean smaller-and-longer.

The real superpower: predictability

Step back from the specific numbers, because the deepest consequence of scaling laws is not any single recipe — it is predictability.

Because the loss really does follow that clean power law, you can train a handful of small, cheap models at different scales, plot their losses, fit the straight line through them, and then extrapolate that line to forecast the loss of a model a hundred times bigger — before you spend a cent training it. This is what de-risks a hundred-million-dollar training run: you commit the budget already knowing, to a good approximation, what you will get out. Scaling laws are what turned frontier-model training from a wild gamble into something closer to predictable engineering — arguably the single most important reason the field could justify spending so much, so confidently.

Scaling laws — and the data wall ahead

Scaling laws, in one breath:

Loss is governed by three levers — parameters, data, compute — and falls as a power law with steep diminishing returns. Compute, the real budget, is about 6 × N × D, which forces a trade-off between bigger and longer. Chinchilla showed the compute-optimal split is roughly 20 tokens per parameter; if you care about serving cost, go smaller and train even longer; and the whole thing is predictable enough to plan billion-token runs in advance.

But notice the quiet assumption running underneath every one of these laws: that you actually have the tokens. Chinchilla wants 20 tokens for every parameter, and the largest models would need tens of trillions of high-quality tokens — and here is the uncomfortable truth the field is now bumping into: good text is finite. The open internet only contains so much quality writing, and the biggest models are starting to run low. When data becomes the scarce resource, which tokens you train on — how the data is sourced, cleaned, filtered and deduplicated — stops being plumbing and becomes one of the most important decisions of all. That data pipeline is Part 11.

Reactions

Related Articles