An LLM on its own is a brain in a jar — it reads text and writes text, nothing more. Give it tools and a loop, though, and it turns into an agent: something that can calculate, search, run code, and act in the world. Here is how tool use and function calling actually work, the reason–act–observe loop behind every agent, and why the real challenge turns out to be reliability and safety, not raw capability.
From reading to doing
Retrieval let the model read facts it never memorised. But reading is not doing. Plenty of real tasks need action: work out an exact sum (which, as we have seen, it is bad at), run a snippet of code, look up something happening right now, query a database, or carry out a multi-step job from start to finish. A pure text-predictor cannot do any of that. This is about giving it the ability to act — the leap from a question-answerer to an agent.
We will build it up cleanly: why an LLM on its own is so limited, how tool use (function calling) works, the ReAct loop that chains tools together, what an agent actually is, and — crucially — why making agents reliable and safe is far harder than making them capable. As always, every term defined as we go.
A brain in a jar
Start with what the model cannot do alone.
On its own, an LLM is a brain in a jar: it reads text and writes text, and that is the entirety of its powers. It cannot reliably do arithmetic (it predicts tokens, it does not calculate), it cannot see today’s news, run code, or affect anything in the world. Ask it “what is 1234 × 5678?” and it confidently produces a plausible but usually wrong number. The fix is not a bigger brain — it is giving the brain hands: tools it can reach for. A calculator for maths, a search engine for fresh facts, a code runner, APIs to act. An LLM plus tools is a fundamentally different system: one that can decide what to do and then actually do it.
Function calling: the model asks, your code does
The mechanism that gives it hands is called tool use, or function calling.
You give the model a set of tools it is allowed to use, each described by a name, what it does, and its parameters. Faced with “1234 × 5678?”, instead of guessing, the model emits a structured call rather than an answer:
You expose: calculator(expr) -> evaluates a maths expression
USER: What is 1234 x 5678?
MODEL: { "tool": "calculator", "args": { "expr": "1234*5678" } } # not an answer, a request
YOU: run calculator("1234*5678") -> 7006652 # YOUR code runs it
MODEL: "1234 x 5678 = 7,006,652." # reads result, answersThe point that trips everyone up: the model does not run the tool. It only decides to call it and produces the request. Your program — the harness around the model — actually executes the calculator, gets the real result, and feeds it back into the context. The model then reads that result and produces the final answer. Because a genuine calculator did the maths, the answer is correct — the model’s arithmetic weakness, neatly fixed. The LLM is the decider; the tools are the doers.
The toolbox
A “tool” is just any function you choose to let the model call. The common toolbox covers its weak spots and extends its reach.
To patch the weaknesses: a calculator for exact maths, web search for facts beyond the training cutoff, a code runner to compute and transform data — each one fixes a specific weakness. To reach into the world: database queries, and external APIs to send an email, book a slot, fetch a live price, post a message. And note that retrieval itself is just a tool — retrieval (RAG) becomes “search the docs,” one more callable function. You hand the model exactly the tools the task needs, no more.
ReAct: reason, act, observe, repeat
One tool call is rarely enough; real tasks need several, in sequence. The standard pattern for chaining them is ReAct — Reason + Act.
The model alternates between two modes. It Reasons about what to do next (the chain-of-thought reasoning we met earlier), then Acts by calling a tool. It reads the tool’s result — the Observation — folds it back into its reasoning, and loops:
Thought: I need today's weather in Tokyo. I should search.
Action: search("weather in Tokyo today")
Observe: "Tokyo: 18C, light rain"
Thought: Got it. Now I can answer the user.
Answer: "It's 18C and lightly raining in Tokyo right now."Reason → Act → Observe → Reason, round and round, until the model has enough to answer (ReAct, Yao et al., 2022). The reasoning keeps it on track and stops it firing tools blindly; the acting gathers real information the model could never have known on its own.
An agent: LLM, tools, loop, goal
Put the loop, the tools and a goal together and you have an agent.
An agent is an LLM running in that ReAct loop, equipped with tools and pointed at a goal — and crucially, you give it a task, not step-by-step instructions. Say “research the top three EVs of 2024 and email me a summary.” The agent plans the steps itself: search the web, read the results, write a summary, call the email tool — executing each, observing the outcome, and deciding the next, without you intervening at every move. That autonomy is the whole point: an agent does not merely answer, it pursues a goal across many tool-using steps. That is the real meaning of the word.
The catch: errors compound
But that autonomy comes with a sharp catch, and it is the single biggest reason agents are hard.
Errors compound. Each step can go wrong, and across a chain the failures multiply. A step that succeeds 95% of the time sounds excellent — until you chain ten of them: 0.9510 ≈ 60%. Twenty steps and you are down near 36%. Long agent runs get fragile alarmingly fast. On top of that, every step is a full LLM call, so agents are far slower and more expensive than a single answer — a real task can be dozens of calls. And they can loop forever, get stuck, or take a wrong turn. So in practice you cap the number of steps, add retries and sanity checks, and keep tasks bounded. The challenge of agents is reliability, not capability.
An acting agent can be dangerous
And there is a deeper concern than reliability, one that should make you cautious.
A question-answerer can only be wrong. An agent that can act can be dangerous — it can delete files, send money, post publicly, or run arbitrary code. Combine that power with hallucination and prompt injection (both of which we come to next), and a single bad decision can cause real-world damage, not just a wrong sentence. So the discipline is non-negotiable: sandbox what the agent can touch, scope its permissions to the bare minimum, and require human approval for risky or irreversible actions — sending, deleting, paying, publishing.
The toolkit complete — but does it work?
Agents, in one view:
Tools give the brain hands — calculate, search, run code, act — via function calling, where the model emits a structured request and your code runs it. Chain that with Reason → Act → Observe and an agent is simply an LLM in that loop with tools and a goal, taking many steps on its own. The hard parts are not capability but reliability (errors compound; it is slow and costly) and safety (an acting agent can do real harm). Use least authority, sandbox actions, and keep a human on the risky steps.
That completes the toolkit: prompting, RAG, and now agents. But across all three a single question looms that we have kept deferring — how do you actually know it works? Your prompt, your RAG system, your agent: are they correct often enough to trust? And can an attacker turn them against you, slipping instructions into a document your agent reads? Measuring quality with evaluations, and defending these systems through red-teaming and prompt-injection awareness, is exactly where we head next.