AI/LLM Mastery · Part 8 of 20 — we wire the Transformer stack into a working GPT: decoder-only, a "no peeking" mask so it can only look left, and a final step that turns a vector back into the next-word probabilities we opened the series with. The black box becomes glass.
A machine that does not speak yet
Part 7 left us holding a beautiful machine that does nothing useful yet. We have a tall stack of Transformer blocks that takes word vectors in the bottom and hands back refined word vectors out the top — same shape, richer meaning. But an LLM is not supposed to output vectors; it is supposed to output the next word. This part closes that gap, and in doing so it closes the entire loop back to Part 1, where we said an LLM is just a very good next-token predictor. Today we finish building exactly that.
The jump from “Transformer” to “GPT” needs only three additions: keep the right half of the architecture, add a rule that stops the model peeking at the future, and add a final step that converts the top vector into a probability for every word in the vocabulary. Three pieces, and the autocomplete from Part 1 is fully assembled.
Step 1: keep only the decoder
First, GPT does not use the whole original Transformer.
The 2017 Transformer was built for translation, so it had two stacks: an encoder that reads a full input sentence, and a decoder that writes the output one word at a time. GPT (from “Generative Pre-trained Transformer,” Radford and colleagues, 2018) does not translate between two languages — it just continues a sequence. For that you only need the writing half, so GPT keeps a stack of decoder blocks and drops the encoder entirely. This is called a decoder-only model, and it is the design behind GPT and most modern LLMs — simpler, and it scales superbly.
Step 2: a word may only look left
Second, and most important: GPT must never look at the future.
Because GPT generates left to right, when it is predicting the next word, the words after it do not exist yet. And during training, where the future words are sitting right there in the example, letting a word peek at them would be cheating — it could just copy the answer it is meant to predict, and learn nothing. So we impose a rule: each word may attend only to words at or before its own position, never to the right. Earlier words shape later ones; never the reverse. This is causal (or masked) self-attention — “causal” because, like cause and effect, only the past can influence the present.
Step 2, in maths: the triangular mask
How do you forbid looking right, in actual maths? With a wonderfully cheap trick on the attention scores from Part 6.
Recall the score grid: row i, column j is how much word i attends to word j. The forbidden “look right” cells are the ones above the diagonal (column later than row). Just before the softmax, we set every one of those future cells to minus infinity. Then softmax does the rest, because softmax(−∞) = 0: every future cell gets exactly zero attention weight. The grid becomes lower-triangular, and the model physically cannot see ahead.
scores = Q Kᵀ / √d # from Part 6 scores[i][j] = -∞ for all j > i # blank out the future (upper triangle) weights = softmax(scores) # -∞ -> 0, so no word sees ahead
Step 3: the output head — vector to next word
Third addition: turn the stack’s output vector back into a word. This is the output head.
After the block stack, the final position holds one vector — a rich summary of everything read so far. We multiply it by an unembedding matrix (the embedding step from Part 3, run in reverse) to get one score, called a logit, for every single token in the vocabulary — tens of thousands of numbers, one per possible next word. Then we apply softmax (Part 6) over those logits to get a probability for every possible next token, all summing to 1.
And look what comes out for “The cat sat on the ___”: mat 41%, floor 19%, sofa 12%… — the exact next-token distribution we opened Part 1 with. We have now built, end to end, the machine that produces it.
Generating text: the autoregressive loop
With a distribution over the next word, generating text is a simple loop.
Pick a token from the distribution (often a likely one — exactly how you pick is Part 14), append it to the text, then feed the whole, longer sequence back into the model and predict again. Repeat, token by token. Because the model consumes its own previous outputs as input, this is called autoregressive generation. It is precisely the loop we sketched in Part 1, now standing on a fully-built engine:
context = "The cat sat on the" while not done: probs = gpt(context) # full forward pass -> next-token probabilities next_tok = pick(probs) # choose a token (Part 14) context = context + next_tok # append it, then loop
Training: every position predicts at once
One more thing the causal mask quietly unlocks: astonishingly efficient training.
Take any sentence from the data — no human labels needed, because the “answer” for each position is simply the next word (the self-supervision from Part 1). Thanks to the mask, position 1 sees only word 1 and predicts word 2; position 2 sees words 1–2 and predicts word 3; and so on — all positions at once, none able to peek ahead. So a six-word sentence yields five next-word predictions in a single parallel forward pass; a long document, thousands. Compare each prediction to the true next word with a cross-entropy loss (a standard “how wrong was this probability guess” measure), then run the forward/loss/backward/update loop from Part 4. Do that over trillions of words and the random dials slowly become GPT.
The whole GPT, in one pass
Let us assemble the complete GPT in one picture — every box is something you have already built.
text -> tokens -> IDs # Part 2 -> embeddings + positional encoding # Parts 3, 7 -> N decoder blocks # Part 7: causal multi-head attn + feed-forward -> final LayerNorm -> unembed -> logits # this part -> softmax -> next-token probabilities # Part 6 + Part 1
Your prompt enters as tokens at the bottom, flows up through the stack, and a probability over the next word comes out the top — then the loop runs again. There is no remaining mystery box: every stage is a piece you understand.
Eight parts in: you have built a GPT
Take a breath and look at what you have done across eight parts.
Text became tokens (Part 2), tokens became embeddings in a map of meaning (Part 3), we built the neural network and its training loop from a single neuron (Part 4), saw why sequences need attention (Part 5) and worked its Q/K/V maths by hand (Part 6), wrapped it into the Transformer block and stacked it deep (Part 7), and today added decoder-only structure, causal masking and a next-token head to make a working GPT. The “well-read autocomplete” from Part 1 is no longer a metaphor — it is a machine you can describe gear by gear.
But there is one enormous caveat we have glossed over the whole time: the dials still start out random. Everything we have built is the architecture — an empty engine. A freshly-built GPT produces pure gibberish. What turns those random weights into something that knows grammar, facts and reasoning is pretraining: running that training loop over a substantial slice of the internet. That is Part 9 — the data, the objective, the scale, and what the loss curve looks like as a model learns to read the world.