AI/LLM Mastery · Part 5 of 20 — the plain neural network from Part 4 cannot really handle language, because language is a sequence. This is exactly how it falls short, why the old fix (RNNs) hit a wall, and the one idea that broke everything open: attention.
A network that cannot read
In Part 4 we built a neural network from zero and watched it learn. It was powerful — but if you look closely, it has a shape problem for language. That network takes a fixed bunch of input numbers, all at once, and treats them as an unordered set on the way to an answer. That is fine for “here are 10 measurements, predict the price.” It is quietly disastrous for language, because language is not an unordered bunch of numbers. It is a sequence: words in a particular order, of varying length, where a word can lean on another word far away.
This part is the hinge of the whole series. We are about to start building the Transformer — the architecture every modern LLM uses — and you cannot appreciate why it is shaped the way it is until you feel the problem it was invented to solve. So first we will make that problem concrete, then meet the clever-but-flawed older fix, and finally arrive at the idea that changed everything: attention. As always, no prior knowledge assumed.
Order carries meaning
Start with the most basic property of language: order carries meaning.
“Dog bites man” and “man bites dog” use the identical three words, yet mean opposite things — one is a normal Tuesday, the other makes the news. The only difference is the order. Now here is the trap: if a model just gathers up the words and ignores their arrangement (a so-called “bag of words”), then both sentences look like the same unordered set {dog, bites, man}, and it literally cannot tell them apart. A plain Part-4 network, fed word-vectors as an unordered lump, has exactly this blind spot. Any real model of language must somehow respect word order.
Words reach across distances
Order is only the first difficulty. The second is distance — words depend on other words that can sit far away in the sentence.
Take: “The trophy did not fit in the suitcase because it was too big.” What does “it” refer to — the trophy or the suitcase? You resolve it instantly: the trophy was too big. But notice what you did — you connected “it” back to “trophy,” a word eight positions earlier. Change one word (“too small”) and the answer flips to “suitcase.” Meaning hinges on a link across a gap. Language is full of these long-range dependencies, and on top of it all, sentences come in wildly different lengths. So our wish-list for a language model is now: respect order, connect distant words, and handle any length. That is a tall order for the fixed-shape network of Part 4.
The old fix: read word by word (RNNs)
For years, the standard answer was an elegant tweak to the neural network called the Recurrent Neural Network, or RNN. Its idea is simply how you read: one word at a time, left to right, keeping a running sense of the story so far.
Concretely, the RNN feeds in the first word and produces a little summary called the hidden state — think of it as its memory so far. Then it takes the next word together with that memory and produces an updated memory, and so on down the sentence. Each word folds into one running memory that is carried forward. This neatly solves two items on our wish-list: it respects order (words come in sequence) and it handles any length (just keep going). For short sentences, RNNs worked genuinely well. The trouble shows up when sentences get long — and it shows up in two different, fatal ways.
Problem one: the memory bottleneck
The first problem is forgetting.
Everything the RNN has read is crammed into that single, fixed-size memory. It is like trying to summarise a whole paragraph into one short sticky note that you have to keep rewriting as new words arrive. There is only so much room, so as the sentence runs on, the earliest words get overwritten and fade. By the time the RNN reaches “it” near the end of our trophy sentence, the far-back “trophy” is already a blur — precisely the long-range link we needed. Smarter RNN variants (the LSTM, introduced by Hochreiter and Schmidhuber in 1997, and the later GRU) added gates to hold onto important things longer, which helped a lot — but the fundamental squeeze never fully went away. The further apart two related words sit, the more likely the connection is lost.
Problem two: strictly sequential, so slow
The second problem is speed — and it turned out to be the dealbreaker.
Because each step needs the previous step’s memory, an RNN is forced to go strictly in order: finish word 1 before starting word 2, and so on. The words line up in a single-file queue. That means you cannot spread the work across all your hardware at once — most of it sits idle, waiting its turn. Now recall the lesson of Part 4: models get good by training on staggering amounts of data. If you must process text one word at a time, in sequence, then training on trillions of words is hopelessly slow. The sequential design that made RNNs sensible was also the thing throttling them. The obvious dream — process every word at the same time — is impossible as long as each word depends on the memory handed down from the word before it.
The breakthrough: attention
So here is the leap. What if we throw out the single running memory altogether — and instead let every word look directly at every other word, and decide for itself which ones matter?
This is attention. Take the word “bank” in “the river bank was muddy.” On its own it is ambiguous (money? river?). With attention, “bank” looks across the whole sentence and scores how relevant each other word is to it. It finds “river” highly relevant and leans on it heavily, the rest barely — and so “bank” updates itself into the river-meaning. That is exactly the contextual embedding we promised back in Part 3, now you can see the machinery behind it. And crucially, every word does this looking-around at the same time: a full web of weighted links, computed in parallel, where any word can reach any other in a single hop no matter how far apart.
Why attention won — and what is next
Line attention up against the RNN’s two fatal flaws and you can see why it took over completely.
The forgetting problem vanishes: with direct word-to-word links, distance no longer matters, so “it” reaches “trophy” in one hop. The speed problem vanishes too: because no word waits on another’s memory, every position is computed in parallel — which is the very thing that finally made training on Part 4’s trillions of words practical on modern hardware. There is one loose end — attention, looking at all words at once, ignores order by default — but that is fixed cleanly by adding position information back in a separate step (we will see exactly how). In 2017, Vaswani and colleagues put this together in a paper boldly titled “Attention Is All You Need,” showing you could drop recurrence entirely and build a model from attention alone. That model is the Transformer.
We now have the why. What we do not yet have is the how: when a word “scores how relevant each other word is,” what is the actual calculation? That scoring is done with three vectors per word — the query, the key, and the value — and in Part 6 we open up self-attention and work that mechanism through, step by step, with numbers. The magic word is about to become plain machinery.