Giving Words Meaning

Part 3 of the AI/LLM mastery series, for complete beginners. A token ID is just a name tag — this is how a model turns each word into an "embedding": a list of numbers that places it in a map of meaning, where "cat" and "dog" sit close and "cat" and "Tuesday" sit far apart. Includes the famous king − man + woman ≈ queen, why it works, and the jump from static to contextual embeddings.

AI/LLM Mastery · Part 3 of 20 — a token ID is just a name tag with no meaning. This is how a model turns each word into a list of numbers that genuinely captures what it means — so that “cat” and “dog” end up close, and “cat” and “Tuesday” far apart.

A number that means nothing yet

At the end of Part 2 we had turned your text into a row of numbers — token IDs like [464, 3797, 3332, 13]. That felt like progress, and it was, but we flagged a catch: those numbers are not meaning. The ID 3797 for “·cat” is simply that token’s slot in a list. It says nothing about cats being furry, being animals, or having anything in common with “dog.” It is a locker number, not an idea.

That is a real problem, because the whole point of a language model is to work with meaning. This article is about the elegant fix — one of the most beautiful ideas in the whole field, and one you can fully picture without any maths. By the end you will understand why a model can tell that “king” relates to “queen” the way “man” relates to “woman,” even though nobody ever told it so.

Why a plain ID cannot work

Let us be precise about why a plain ID is useless for meaning, because the fix falls out directly from the problem.

The vocabulary list’s order is essentially arbitrary — often just the order tokens happened to be learned. So the token sitting at 3798, right next to “cat,” might be “Tuesday.” Being one number away means nothing. You cannot ask “is 464 closer in meaning to 3797 than to 3332?” — the question is gibberish, because the IDs carry no relationships at all. A bigger ID is not bigger or better; it is just a different label. What we actually want is the opposite of arbitrary: numbers arranged so that closeness equals similar meaning.

The fix: describe a word with a list of numbers

Here is the move. The reason one ID fails is that it is a single number — one dial — and meaning is far too rich for one dial. So we stop describing a word with one number and describe it with a whole list of numbers instead.

That list of numbers is called a vector — do not let the word scare you; a vector is just an ordered list of numbers, exactly like map coordinates (latitude, longitude) are a list of two numbers that pin down a place. We give “cat” a vector like [0.21, -0.73, 0.88, …]. Each number in the list is called a dimension, and you can loosely imagine each one capturing some aspect of the word — how “animal-like” it is, how “big,” and so on — though in reality the model invents its own hidden aspects that rarely map to neat human labels. This list of numbers is the word’s embedding: the model’s internal representation of it. Real embeddings are not 8 numbers long but hundreds or even thousands, which gives them enormous room to encode subtle meaning.

The meaning map: close means similar

Now for the picture that makes it all click. If every word is a vector — a set of coordinates — then every word is a point in space. And here is the property the whole idea hangs on: similar words get placed near each other.

Think of it as a map of meaning. Animals (cat, dog, kitten, puppy) cluster in one neighbourhood. Vehicles (car, truck, bus) sit in another, far away. Days of the week form their own little group somewhere else. On this map, “cat” and “dog” are close because they mean similar things, while “cat” and “Tuesday” are far apart because they do not. The model can now do something it never could with IDs: measure meaning as distance. Close points mean similar things; distant points mean different things.

One honest catch, so the picture does not mislead you: real embedding space is not flat with two axes — it has hundreds of dimensions, which no human can visualise. This 2D map is a flattened shadow of that high-dimensional space. But the core intuition survives the flattening perfectly: near means similar, far means different.

The big idea in one line: an embedding turns each word into coordinates in a “meaning space,” where distance between points is distance between meanings.

Where the coordinates come from: context

A fair question at this point: who decides where each word goes on the map? Nobody places them by hand — that would be impossible for a vocabulary of tens of thousands of words. As with everything in Part 1, the positions are learned from data. The clue the model uses is beautifully simple: the company a word keeps.

Look at a sentence with a gap: “The ___ purred on my lap.” You instantly know only a few words fit — cat, kitten. Now “The ___ chased the ball” invites dog or puppy. Each blank is a kind of fingerprint of meaning: the set of words that can fill it. The insight is that words which keep appearing in the same kinds of contexts must mean similar things — so during training the model nudges their vectors closer together. Do this across millions of sentences and the whole map arranges itself, with no human ever defining a single word.

This is not a new idea — the linguist J. R. Firth wrote in 1957, “you shall know a word by the company it keeps.” What was new was making it work at scale: in 2013 a system called word2vec (Mikolov and colleagues at Google) learned exactly these word-vectors from huge text and showed the results were uncannily good — which brings us to the result that made everyone pay attention.

Relationships become directions: king − man + woman ≈ queen

If meaning really is geometry, then relationships between meanings should be geometry too — and they are. This is the famous one.

Place “man,” “king,” “woman” and “queen” by their learned vectors and they form a neat rectangle. The arrow that takes you from “man” to “king” points in a consistent direction — loosely, it adds the idea of “royalty.” Take that same arrow and start it from “woman,” and you land right next to “queen.” Written as maths, that is the now-legendary result:

meaning as arithmetic
# vectors can be added and subtracted like coordinates
king - man + woman    queen

# the "royalty" direction and the "gender" direction
# emerged on their own — nobody programmed them.

Sit with how strange that is. The model was never told that “queen” is to “woman” as “king” is to “man.” That relationship simply fell out of the geometry once words were placed by their contexts. Meaning relationships had quietly become directions in space — and that is the clearest possible proof that these vectors capture something real about meaning, not just bookkeeping.

One word, two meanings: static vs contextual

Before we move on, one honest complication — because it is exactly what the next few parts of this series are built to solve.

The word “bank” means a riverside in “sat on the river bank,” and a place for money in “money in the bank.” A single fixed vector for “bank” is stuck averaging both meanings — blurry and often wrong. The early word2vec embeddings had exactly this limitation: one fixed vector per word, the same no matter the sentence. We call those static embeddings.

Modern language models do something smarter. They compute a fresh, contextual embedding for each word that depends on the whole sentence around it — so “bank” lands near “river” in one sentence and near “money” in another. How a model blends the surrounding words into each word’s vector is precisely the job of the Transformer’s “attention” mechanism — the engine we start taking apart in Parts 5 to 7. For now, just hold the upgrade in mind: from one fixed vector per word, to a fresh vector shaped by context.

Why this matters beyond LLMs: measuring similarity as distance between embeddings is also how semantic search, recommendations, and the “retrieval” in Part 17’s RAG all work — you embed everything, then find the nearest points. Embeddings are quietly everywhere.

Where this leaves us

Step back and see how far the picture has come. In Part 1 an LLM was a well-read autocomplete. In Part 2 your words became token IDs — bare numbers. Now, in Part 3, each of those tokens becomes a rich vector of meaning, a point on a map where distance is similarity and even relationships are directions. The model finally has something meaningful to compute with.

But notice the verb in that last sentence: compute. We keep saying the model “learns,” “nudges vectors,” and “blends context” — and we have been treating the machine that does all this, the neural network, as a black box full of little dials. It is time to open that box. In Part 4 we build a neural network from absolute zero — one dial at a time — so that when we reach the Transformer, there is no magic left in it, only machinery you have seen built with your own eyes.

Reactions

Related Articles