The Data Pipeline

Part 11 of the AI/LLM mastery series — the unglamorous machinery that decides model quality. How a raw web scrape (mostly junk) becomes clean training fuel: extraction, quality filtering, deduplication (and why it stops memorisation), mixing sources on purpose, why curated data can beat more data ("Textbooks Are All You Need"), and the looming data wall plus synthetic data and model collapse.

AI/LLM Mastery · Part 11 of 20 — a model is what it eats, and the open web is mostly junk. This is the unglamorous, decisive machinery that turns a raw web scrape into clean training fuel: extraction, filtering, deduplication, mixing — and why good data is fast becoming the scarcest resource of all.

The most underrated part of an LLM

Part 10 ended at a wall. Scaling laws cheerfully demand more tokens, but they all quietly assume those tokens exist — and high-quality human text turns out to be finite. The moment data becomes the scarce resource, a question that used to be an afterthought becomes one of the most important in the whole field: not just how much text you train on, but which text, and how you clean it. That is the data pipeline, and it is where a surprising amount of a model’s quality is actually decided.

This part is the least glamorous and most underrated stage of building an LLM. There is no clever equation here — just the careful, large-scale janitorial work of turning a raw internet scrape into something worth learning from. We will walk the whole pipeline, stage by stage, and finish on the looming “data wall” and the controversial idea meant to get around it.

A model is what it eats

Start with the principle that makes data matter at all.

From Part 9, a model’s entire knowledge comes from its training text — there is no other source, no separate “facts database.” The data is the curriculum. Feed a model clean, well-written articles, books and correct code, and it learns good language, real facts and sound reasoning. Feed it spam, gibberish and errors, and it learns — just as faithfully — to produce spam, gibberish and errors. It cannot tell good from bad; it only imitates what it is shown. And here is the crux: at the trillion-token scale from Part 10, nobody can read the data by hand. So the quality of an automated pipeline silently sets the ceiling on how good the model can ever be. Garbage in, garbage out — at planetary scale.

The raw material: a mostly-junk web

Now meet the raw material, in all its ugliness.

The biggest single source is a public web scrape like Common Crawl — petabytes of pages, far more than books or Wikipedia could ever provide. The problem is that the raw web is mostly not what you want. It is dominated by spam, SEO sludge, navigation menus and ad boilerplate, adult content, machine-generated gibberish and broken markup. The clean, well-written prose you would actually want a model to learn from is a small fraction of the whole. You cannot simply dump Common Crawl into pretraining and hope — doing so produces a model that talks like a spam page. It has to be cleaned, hard, first. That cleaning is the pipeline.

The pipeline, stage by stage

So here is the assembly line that turns that mess into fuel. Each stage throws away more, and the pile shrinks dramatically from start to finish.

the data pipeline, end to end
raw web  (petabytes)
   1. EXTRACT     pull the main text out of HTML  (drop menus, ads, footers)
   2. FILTER      keep quality text, drop junk    (heuristics + a classifier)
   3. DEDUPLICATE remove exact + near-duplicate documents
   4. SAFETY      strip toxic content and personal info (PII)
   5. MIX         blend sources in chosen proportions  (web/books/code/wiki)
   6. TOKENIZE    train the tokenizer (Part 2), turn it all into token IDs
-> clean, deduplicated, tokenized corpus  (the fuel Part 9 consumes)

It is worth saying plainly: a huge share of the real effort at top labs goes into this, not the model architecture. The architecture has been broadly settled since 2017 (Parts 5–8); the data pipeline is where much of the competitive differentiation now lives. Let us zoom in on the three stages that matter most.

Stage one: quality filtering

First, quality filtering — deciding what even counts as “good” text.

Two complementary tools do the work. Heuristics are cheap rules that catch obvious junk: documents that are too short, mostly symbols or numbers, in the wrong language, full of repeated lines, or with no real sentences. They are fast and remove the worst offenders. Then a quality classifier — a small model trained to answer “does this read like trusted reference text, such as Wikipedia or a book?” — scores every document, and only the high-scoring ones are kept. It catches the subtler junk that simple rules miss. The catch is the threshold: set it too strict and you throw away good, diverse text (and risk a bland model); too loose and junk leaks back in. Tuning that line is genuine engineering, and a quiet battleground between labs.

Stage two: deduplication

Second, deduplication — and do not skip it, because it matters far more than it sounds.

The web is enormously repetitive: a single article copied across hundreds of sites, the same licence text and boilerplate on millions of pages. Deduplication finds exact and near-duplicate documents (near-duplicates are spotted efficiently with techniques like MinHash) and keeps just one copy of each. Why bother? Two reasons, both serious. Duplicates waste compute — you are paying to learn the same passage a hundred times. And worse, they cause memorisation: when text appears over and over, the model stops generalising and starts reproducing it word-for-word — which both wastes capacity and can leak private or copyrighted text verbatim. Studies have shown that aggressive deduplication measurably improves models and reduces this regurgitation. It is one of the highest-value, lowest-glamour steps in the entire process.

Stage three: the mix

Third, the mix — because you never train on one source alone.

A real pretraining set blends several cleaned sources — web pages, books, Wikipedia, code, academic papers — and crucially, it weights them on purpose. You might upweight trusted books and Wikipedia, deliberately include a large slice of code, and choose the balance of languages. The mix is a powerful, underappreciated lever: adding more code measurably improves a model’s reasoning (even on non-code tasks), and more multilingual text broadens its language coverage. The exact recipes are closely guarded secrets, because two models trained on the same total number of tokens but a different mix come out noticeably different. What you feed, and in what ratio, becomes part of the model’s character.

Quality can beat quantity

All of this leads to a finding that flips the simple “more is better” reading of Part 10.

Scaling laws push for more tokens — but a run of results has shown that better tokens can beat simply more tokens. The landmark example is Microsoft’s phi models and their paper title — “Textbooks Are All You Need” (2023): small models trained on a modest amount of carefully curated, textbook-quality data matched or beat far bigger models trained on raw web. The lesson is that data curation is a genuine competitive moat. It also explains a neat connection back to Part 2: the tokenizer’s vocabulary (its BPE merges) is learned from this curated data — so your data choices even shape how well the model tokenizes code, maths and other languages. Data quality echoes through everything.

The shift in mindset: the field has moved from "scrape as much as possible" to "curate the best you can." With architecture settled, data is increasingly where models are won or lost.

The data wall — and from knowledge to behaviour

Which brings us back to the wall that Part 10 warned about — now with the pipeline to understand it.

High-quality human-written text is finite, and the largest models, hungry for tens of trillions of quality tokens, are starting to approach the limit of what the open web can supply. There are several responses. Squeeze harder — better filtering and deduplication to extract more value from the text that exists (the pipeline only grows in importance). Make moresynthetic data, using existing models to generate fresh training text (partly how the phi models were built). But synthetic data carries a real danger: model collapse, where training repeatedly on AI-generated output slowly degrades quality, as the model learns from echoes of itself. And there are new frontiers — more code, multimodal data (images, audio, video), and licensed private datasets. How this plays out is one of the open questions of the field.

With Parts 9–11 we have done the whole “Training & Scale” story: a base model, pretrained on a vast, carefully-built corpus, that knows an enormous amount. And yet — as Part 9 kept warning — it is still not a helpful assistant. It will happily continue your question with more questions. Everything so far has built a brilliant text-continuer with knowledge but no manners. Part 12 begins the next era: fine-tuning and instruction tuning, where we take that base model and teach it to actually follow instructions and behave like the assistant you talk to.

Reactions

Related Articles