Inside Self-Attention

Part 6 of the AI/LLM mastery series — the math of attention, worked by hand. Query, key and value vectors; the dot-product score; scaling by √d; softmax into attention weights; and the weighted sum of values that gives a word its context — all carried through one running example, "the river bank", end to end, plus the famous softmax(QKᵀ/√d)V in code.

AI/LLM Mastery · Part 6 of 20 — the magic word "attention" becomes plain machinery. Query, key, value; a dot product; a divide; a softmax; a weighted sum. We work the whole thing end to end on a real example, with numbers.

From a magic word to a calculation

Part 5 left us with a powerful idea and a deliberate gap. The idea: every word looks at every other word and decides which ones matter, then rebuilds itself as a blend of the relevant ones. The gap: when a word “decides which ones matter,” what is the actual calculation? We promised three vectors per word — a query, a key and a value — and today we open the box and turn that magic word into arithmetic you can do by hand.

This is a math-forward part, like Part 4, because the only way to truly own self-attention is to compute one. So we will use a single tiny example all the way through — the phrase “the river bank,” focusing on the word “bank” — and carry the same little numbers from the first step to the last. Nothing here needs maths you do not already have; the only two operations are “multiply and add” (which you met in the neuron) and a normaliser called softmax, which we will define when we reach it.

Step 1: three vectors — query, key, value

Self-attention’s first move is to make three new vectors from each word’s embedding (the meaning-vector from Part 3).

Each word’s embedding is multiplied by three different learned weight matrices — called Wq, Wk and Wv — producing a Query, a Key and a Value. (A weight matrix is just a grid of the same kind of dials from Part 4; multiplying a vector by it produces a new vector. The network learns these grids during training.) The three roles are best felt through a library search:

Query — what this word is looking for (“what’s relevant to me?”), like the search you type. Key — how each word advertises itself, like the label on a library shelf; a query matches against keys. Value — the actual content a word hands over if it gets attended to, like the book you pull off the shelf. Every word produces all three, because every word both asks questions (query) and answers other words’ questions (key + value).

Step 2: score with a dot product

To find how relevant word B is to word A, we compare A’s Query with B’s Key using a dot product. If you have not met it, the dot product is simple:

A dot product multiplies two vectors position-by-position and adds the results into a single number. Take bank’s Query [2, 1] and river’s Key [3, 2]: that is 2×3 + 1×2 = 6 + 2 = 8. A big number — because the two vectors point in a similar direction, which is exactly what “these two words are relevant to each other” looks like in vector form (recall the meaning-map from Part 3: aligned vectors mean related things). Do the same for every word and you get a row of raw scores — here, the = 1, river = 8, bank = 2. River wins, as it should.

Step 3: scale by √d

One small but important adjustment before we go on — it is literally why the technique is named scaled dot-product attention.

Dot products grow as the vectors get longer (more dimensions, d). If we let raw scores get huge, the next step (softmax) would dump almost all attention onto a single word and the gradients from Part 4 would shrink to nothing — training stalls. The fix is trivial: divide every score by √d, the square root of the vector length. With our small vectors that is roughly a divide by 1.4, turning [1, 8, 2] into about [0.7, 5.7, 1.4]. Same ranking, gentler numbers, healthy training.

Step 4: softmax into attention weights

Now turn those scaled scores into actual attention weights with softmax.

Softmax is a small recipe that converts any list of numbers into a set of positive weights that add up to exactly 1. It does two things: raise e (about 2.718) to the power of each score — which makes everything positive and stretches the gaps so the biggest score pulls ahead — then divide each by the total so they sum to one.

softmax, in one line
# turn scores into weights that sum to 1
weight_i = e^(score_i) / sum_over_all_j( e^(score_j) )

Applied to [0.7, 5.7, 1.4] it gives roughly [0.01, 0.98, 0.01]. Read that as plain English: “bank” is paying 98% of its attention to “river” and almost none elsewhere. Because the weights always sum to 1, attention is literally a way of splitting 100% of a word’s focus across the sentence.

Step 5: blend the values

Final step. We have the weights; now we use them to blend the Values.

Each word offers its Value vector, and the output is the weighted sum: output = 0.01×V(the) + 0.98×V(river) + 0.01×V(bank). Since river holds 98% of the weight, the result is almost entirely river’s Value — about [0.99, 0.02]. That output vector becomes bank’s new representation: “bank” has absorbed “river” and now means river-bank. This is exactly the contextual embedding we promised in Part 3, no longer a promise but a computed result. And note the elegance: the query/key step decides how much to listen to each word, while the value step decides what you actually get from it — two separate jobs, which is why we needed three vectors and not one.

The whole thing in one equation

Those five steps — score, scale, softmax, weight, sum — collapse into a single line, the most quoted equation in modern AI (from Vaswani et al., 2017):

scaled dot-product attention
Attention(Q, K, V) = softmax( Q Kᵀ / √d ) V

# Q, K, V : matrices stacking the query/key/value of EVERY word
# Q Kᵀ    : all the dot-product scores at once
# / √d   : the scaling step
# softmax : turn scores into weights (each row sums to 1)
# ... V   : blend the values -> one new vector per word

The crucial practical point: because Q, K and V stack every word’s vectors into matrices, this one expression computes the new vector for every word at the same time — it is just matrix multiplication, which hardware does massively in parallel. That is the speed win over RNNs from Part 5, made concrete. And it is called self-attention because Q, K and V are all derived from the same sentence: the sentence attends to itself.

In code, stripped to its essence, the whole mechanism is only a few lines — which is remarkable for something this powerful:

self-attention, in essence
def attention(Q, K, V, d):
    scores  = Q @ K.T / sqrt(d)      # 1-2. score (dot products) + scale
    weights = softmax(scores)        # 3-4. into weights that sum to 1
    return weights @ V               # 5.   blend the values

The full pass — and what is still missing

Let us run the complete pass for “bank” one more time, top to bottom, so the pipeline is burned in.

Query [2, 1] → scores [1, 8, 2] → scaled [0.7, 5.7, 1.4] → softmax [0.01, 0.98, 0.01] → output ≈ [0.99, 0.02], which is essentially river’s Value. Five steps turned an ambiguous “bank” into a context-aware river-bank — and the model does this for every word in the sentence, all at once.

What you can now do: read the line softmax(QKᵀ/√d)V and narrate every symbol — what it is, why it is there, and what number comes out. That equation is the engine room of every LLM you have ever used.

But one self-attention pass is not the whole Transformer. Real models run many attention calculations side by side (each learning to look for a different kind of relationship), stack the whole thing dozens of layers deep, and — because attention on its own is blind to word order — add position information back in. Those pieces, the ones that turn a single attention step into a full Transformer block, are Part 7.

Reactions

Related Articles