AI/LLM Mastery · Part 14 of 20 — the model hands you a probability for every next word, but it must commit to one. The knobs that control that choice — greedy, temperature, top-k, top-p — decide whether it sounds robotic, creative, or unhinged. Plus the KV cache that makes it all fast.
From a distribution to a single word
Everything since Part 8 has been about getting the model to produce a good probability distribution over the next token — mat 41%, floor 19%, sofa 12%, and so on — and Part 13 made that distribution well-aligned with what humans want. But there is a final, very practical step we have skated over: the model cannot output a distribution. It has to commit to one token, append it, and move on (the autoregressive loop). How it makes that pick — called decoding — turns out to matter enormously. The exact same model can sound robotic, brilliant, or completely unhinged depending purely on these settings.
This part is about the knobs you actually turn when you use an LLM — the ones sitting in every API call: temperature, top_p, top_k. None of them change the model’s weights or what it knows; they only change how it reads that probability distribution to pick each word. We will also meet the KV cache, the trick that makes generation fast enough to be usable at all. As always, no prior knowledge assumed.
The choice: greedy or sample
Start with the fork in the road.
At each step the model gives a probability to every token in its vocabulary. To produce text it must choose one. There are two fundamental strategies, and everything else is a refinement of them. Greedy: always take the single highest-probability token. Sampling: roll a weighted die — pick each token in proportion to its probability, so the 41% token comes up about 41% of the time. Greedy is safe and predictable; sampling is varied and lively. Each has a failure mode, which is why we need the tuning knobs.
Greedy: always the top (and it loops)
Take greedy first, because its weakness motivates everything else.
Greedy decoding picks the most-likely token at every step. Its great virtue is that it is deterministic — the same prompt yields the exact same output every time, which is invaluable when you need reproducibility. But always choosing the safest next word makes the text bland, and worse, it frequently falls into loops: “the situation is very, very, very, very…” The most probable continuation of a repetitive phrase is often more of the same, so greedy digs itself into a rut. For open-ended writing, pure greedy is rarely what you want. We need controlled randomness.
Sampling: varied, but the tail bites
So we sample — but raw sampling has its own problem.
Sampling picks tokens in proportion to their probability, so output becomes varied, natural, and creative — and the same prompt can now give a different answer each run. That is exactly why ChatGPT does not repeat itself verbatim. The danger lives in the long tail: the distribution has thousands of low-probability tokens, and occasionally the die lands on one, sending the text careening into nonsense. So the practical art of decoding is to tame that distribution before sampling — reshape it, or trim the tail. That is what the next three knobs do.
Temperature: the creativity dial
The most important knob is temperature, and it is the closest thing an LLM has to a creativity dial.
probabilities = softmax( logits / T ) # T is the temperature T -> 0 : sharpest -> always the top token (= greedy, deterministic) T = 0.2 : sharp -> focused, safe, reliable T = 1.0 : the raw, unmodified distribution T = 1.5 : flat -> creative, surprising, risky
Before sampling, the logits (the raw scores behind the probabilities) are divided by a number T, then softmaxed. A low temperature makes the distribution sharper — the top token dominates, output is focused and near-deterministic (as T→0 it becomes greedy). A high temperature flattens it — unlikely tokens get a real chance, output turns creative and wild. So the rule of thumb is simple: low temperature for facts, code and anything that must be correct; higher temperature for brainstorming and creative writing.
Top-k: keep only the top few
Temperature reshapes the whole distribution; the next two knobs instead trim it. First, top-k.
Top-k sampling keeps only the k most-likely tokens, discards the rest, renormalises, and samples from what remains. With k=40, the absurd long tail is simply gone, so sampling can never pick something ridiculous. The limitation is that k is fixed: sometimes a fixed cut-off is too tight (the model is genuinely uncertain and there are many reasonable options you have just thrown away), and sometimes too loose (the model is confident in one answer, yet you still keep 40). A constant k cannot adapt to how sure the model is — which is what the next knob fixes.
Top-p (nucleus): keep a share, not a count
Top-p, also called nucleus sampling, is the smarter, adaptive version.
Top-p keeps the smallest set of top tokens whose probabilities add up to p (say 0.9), and drops the rest. You specify a probability share, not a fixed count — and that makes it adapt to the model’s confidence automatically. When the model is sure (one token already holds 90% of the mass), the nucleus is just one or two tokens. When it is uncertain (the mass is spread thin), the nucleus grows to include more options. Because it tracks the model’s own confidence, top-p is usually preferred over a fixed top-k (it was introduced by Holtzman et al., 2019). In practice you combine these knobs — a temperature plus a top_p — which is exactly what you see in an API call:
response = client.generate(
prompt = "Write a tagline for a coffee shop",
temperature = 0.9, # fairly creative
top_p = 0.95, # nucleus: keep the top 95% of probability mass
max_tokens = 30, # stop after ~30 tokens
)
# for code or facts you'd instead use temperature ~ 0.0-0.3The KV cache: why it is fast
One more piece — not a quality knob, but the reason generation is fast enough to use at all: the KV cache.
Generation is autoregressive (Part 8): to produce token N you feed the model all N−1 tokens so far. Done naively, every single new token would re-run attention over the entire sequence, recomputing the key and value vectors (Part 6) for every earlier token — even though those tokens have not changed. That is enormously wasteful. The fix: a KV cache that stores the keys and values of all past tokens. Each new token computes only its own key and value and attends against the cache — no recomputation of the past. This is why the first token is slow (“prefill”, where the model digests your whole prompt) while every token after it is fast (“decode”). The price is memory: the cache grows with the length of the context, which is a major reason very long contexts are slow and expensive to run.
The knobs — and the limit no knob can fix
The decoding knobs, in one view:
Greedy (or temperature = 0) is deterministic and focused — reach for it with code and facts. Temperature is your creativity dial. Top-k and top-p trim the tail before sampling, with top-p adapting to the model’s confidence and usually winning. Any sampling means the same prompt can give different answers, so set temperature to zero when you need repeatability. And the KV cache is a speed mechanism, not a quality one. Master these and you can dial the same model from a precise tool to a wild brainstorming partner.
But notice the hard boundary on all of this: every one of these knobs changes how the model writes, never what it knows. No temperature setting, no clever sampling, can stop a model from producing beautifully fluent, supremely confident text that is simply, factually wrong — and saying it with exactly the same certainty as the truth. Why these models make things up, what is actually happening when they do, and the other hard limits you can never tune away — that is hallucination, and it is Part 15.