The Transformer Block

Part 7 of the AI/LLM mastery series. One self-attention is not a Transformer — this assembles the full block: positional encoding (so order survives), multi-head attention (many relationships at once), residual connections and layer normalization (the glue that makes deep stacks trainable), and the feed-forward network where each word is processed. Then we stack the blocks into the core of every modern LLM.

AI/LLM Mastery · Part 7 of 20 — one self-attention is not a Transformer. We add the missing pieces — word positions, many attention heads, the feed-forward network, and the glue that lets it all stack deep — and assemble the actual block every LLM is built from.

From one attention step to a real block

In Part 6 we computed one self-attention pass by hand and watched “bank” turn into river-bank. That was the engine. But an engine is not a car. A real Transformer wraps that attention step in several more pieces — each one fixing a specific weakness — and the result is a single, repeatable unit called the Transformer block. In this part we add those pieces one at a time and snap them together, so by the end you can read a full block diagram and know exactly what every box does and why it is there.

There is a loose end to tidy first. Back in Part 5 we insisted that word order carries meaning. Then in Part 6 we built attention — and, quietly, attention threw that order away. Let us start there, because the fix is the first component of the block.

The loose end: attention is order-blind

Here is the uncomfortable truth about the attention you just learned: it does not know what order the words came in.

Remember the final step from Part 6 — a word’s output is a weighted sum of the other words’ values. A sum does not care about order: a + b + c equals c + a + b. So if you shuffled the input words, each word would compute the exact same result. Pure attention treats the sentence as an unordered bag — the very “bag of words” failure we mocked in Part 5. Attention bought us long-range links and parallelism, but it cost us word order. We have to put order back.

Fix: positional encoding

The fix is refreshingly direct: tag each word with where it sits.

A word’s embedding (Part 3) says what it means but nothing about where it is. So for every position — slot 0, 1, 2, and so on — we build a unique vector called a positional encoding, and simply add it to the word’s embedding. The original Transformer used smooth sine and cosine patterns for these vectors; many modern models just learn them. Either way the effect is the same: after the addition, each vector carries both meaning and position, so “river” at slot 1 is no longer identical to “river” at slot 5. Attention can finally use order. Notice it is an addition, not a new mechanism — cheap, and it keeps the parallelism intact.

Upgrade: multi-head attention

Next upgrade. A single attention learns just one way of relating words — but language has many relationships running at once.

So instead of one attention, a Transformer runs several in parallel, called heads. Each head has its own Wq, Wk, Wv matrices (its own dials), so each learns to look for something different. When researchers inspect trained models, heads really do specialise: one tends to link a pronoun to the noun it refers to, another an adjective to what it describes, another a verb to its subject. After all heads run, their outputs are concatenated (stuck together) and passed through one more learned matrix that mixes them into a single, richer vector per word. That bundle — many heads, then a projection — is multi-head attention, and it is what the block actually uses in place of the single attention from Part 6.

Why many heads: one head is one lens. Multi-head attention looks at the sentence through several lenses at once, then combines what each saw — far more expressive, at almost no extra cost since they run in parallel.

Glue 1: residual connections

Now the glue — two unglamorous pieces that do not change what the block computes, but without which a deep stack simply will not train. The first is the residual connection.

Stack many layers and the signal — and the gradients from Part 4 — tend to fade as they pass through, so deep networks struggle to learn (an echo of the RNN forgetting in Part 5). The residual connection (or skip connection) fixes it by keeping a copy of a sub-layer’s input and adding it back to the output:

a residual connection
output = x + sublayer(x)
# the original x gets a clean "highway" straight through;
# the sub-layer only has to learn a small change to add on top.

That clean path lets gradients flow freely all the way down, and means each layer only learns a small edit rather than rebuilding everything from scratch. This idea (the ResNet, He et al., 2015) is precisely what makes networks dozens or hundreds of layers deep trainable at all.

Glue 2: layer normalization

The second piece of glue keeps the numbers from spiralling.

As vectors pass through layer after layer, their values can swell or shrink unevenly, and wild numbers make training unstable. Layer normalization rescales each vector so its values sit in a tidy, consistent range every time (roughly centred at zero with a steady spread), plus two small learned dials (a scale and a shift) so the model can still stretch the range when it helps. Dropped in at each sub-layer, it keeps the whole deep stack numerically calm (Ba, Kiros and Hinton, 2016). Together, residuals and layer norm are why a 96-layer Transformer trains smoothly instead of falling apart.

Component: the feed-forward network

One real component left. After attention has mixed context between words, each word gets a moment to think on its own.

Every word’s vector is passed through a small feed-forward network — the exact plain neural net from Part 4 — applied to each word independently. Its shape is simple: a linear layer expands the vector to a much larger hidden size, an activation (the ReLU/GELU bend from Part 4) adds non-linearity, and a linear layer shrinks it back. If attention is where words talk to each other, the feed-forward network is where each word is processed — and it is where a large fraction of a model’s learned facts actually live, since these layers hold most of its weights.

Assembling the Transformer block

Time to snap the pieces together. A Transformer block is just two sub-layers, each wrapped in the layer-norm-and-residual glue.

one Transformer block, in essence
# x = word vectors (with positional encoding added)
x = x + MultiHeadAttention( LayerNorm(x) )   # sub-layer 1: words mix context
x = x + FeedForward(        LayerNorm(x) )   # sub-layer 2: each word is processed
# out: refined word vectors, SAME shape as went in

Read it top to bottom: normalize, attend, add the residual; then normalize, feed-forward, add the residual. That is the entire block. The detail that matters most for what comes next is the last comment: the output has the same shape as the input. Vectors go in, better vectors of the identical size come out.

Stacking blocks — and what is next

That same-shape property is the whole point, because it means blocks are stackable like identical bricks.

Feed the output of one block straight into another, then another — 12 of them for a small model, 96 or more for a large one. Each block refines the representations a little further: the early blocks tend to capture surface patterns like grammar, while deeper blocks build up abstract meaning and the beginnings of reasoning. Position-aware vectors go in the bottom, a tall stack of identical blocks does its work, and refined vectors come out the top. That stack is the core of every modern LLM — the architecture Vaswani and colleagues introduced in 2017, now scaled to the sizes from Part 4.

We have built the machine. What we have not done is make it do anything — turn this stack of refined vectors back into the one thing an LLM actually produces: the next word. In Part 8 we wire the Transformer into GPT — how it reads only leftward (so it cannot peek at the answer), and how the vector at the end becomes a probability over the whole vocabulary, looping back to the next-token prediction from Part 1. The pieces are about to become a working language model.

Reactions

Related Articles