Agentbrisk

Tool Calling in LLMs Explained: How AI Agents Use External Functions

April 23, 2026 · Editorial Team · 10 min read · explainerfundamentalstool-use

Most people who use AI tools every day have no idea what is actually happening when an agent browses a website, runs a code snippet, or queries a database on their behalf. The model is not doing those things directly. It is calling tools. Understanding how that works is one of the most useful mental models you can have if you build with or on top of large language models.

This guide covers tool calling from first principles: what it is, how JSON schema defines tools, how the model decides to use them, what structured outputs have to do with it, and where the whole mechanism shows up in real systems.

What tool calling actually is

A language model, at its core, generates tokens. It does not execute code. It does not talk to APIs. It does not read your filesystem. It predicts what text should come next given the text that came before.

Tool calling is the mechanism that bridges that limitation. Instead of generating free-form prose, the model is trained to recognize situations where calling an external function would be more useful than writing a direct answer, and to produce a structured description of that call. The surrounding infrastructure reads that structured output, executes the actual function, and feeds the result back to the model. The model then continues generating with the new information.

From the model's perspective, calling a tool means producing a specific chunk of structured text. From the system's perspective, that structured text is a set of instructions for running real code. The effect is that the model can reach outside itself to do things it could never do by generating tokens alone.

This is why tool calling is the dividing line between a chatbot and an agent. A chatbot generates answers. An agent generates actions. Tool calling is the mechanism that makes the second category real.

How the model learns which tools exist

Before a conversation begins, the developer passes the model a list of available tools. Each tool has three things: a name, a description, and a parameter schema defined using JSON Schema.

The description is written in plain language and tells the model what the tool does and when to use it. The schema tells the model what arguments the tool expects and what type each argument should be. Together they give the model enough context to decide whether to call the tool and, if so, exactly how to call it.

Here is a minimal example of what a tool definition looks like:

{
  "name": "search_web",
  "description": "Search the web for recent information on a topic. Use this when you need facts you may not have or when the user asks about recent events.",
  "parameters": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string",
        "description": "The search query to send"
      }
    },
    "required": ["query"]
  }
}

The model never sees the actual function implementation. It only sees the name, the description, and the parameter schema. Writing good descriptions is therefore one of the most impactful things a developer can do when building tool-using systems. A vague description leads to the model calling the wrong tool or not calling any tool when it should.

JSON Schema and why it matters

JSON Schema is the standard used to describe the shape of the arguments a tool accepts. It specifies data types, which fields are required, what values are valid, and how nested objects are structured. Most LLM providers use a subset of the full JSON Schema specification, but the core vocabulary is consistent.

The key types you will use are string, number, integer, boolean, array, and object. Nested objects follow the same pattern recursively. Enums let you restrict a field to a fixed set of allowed values, which is useful when you want to control the model's choices.

Here is a more complete example for a calendar scheduling tool:

{
  "name": "create_calendar_event",
  "description": "Create a new calendar event for the user.",
  "parameters": {
    "type": "object",
    "properties": {
      "title": { "type": "string" },
      "start_time": {
        "type": "string",
        "description": "ISO 8601 datetime string"
      },
      "duration_minutes": { "type": "integer", "minimum": 5 },
      "attendees": {
        "type": "array",
        "items": { "type": "string", "format": "email" }
      },
      "priority": {
        "type": "string",
        "enum": ["low", "normal", "high"]
      }
    },
    "required": ["title", "start_time", "duration_minutes"]
  }
}

The model uses this schema to construct a valid call. Because the schema is precise, the infrastructure can validate the model's output before executing anything. A model that hallucinates a field that does not exist in the schema will fail validation, and the system can catch that before any damage is done.

The tool calling lifecycle

When you send a message to a model that has tools available, the lifecycle runs like this:

  1. The model receives the conversation history plus the tool definitions.
  2. It generates a response. That response is either regular text, a tool call, or both.
  3. If the response contains a tool call, the infrastructure extracts the function name and arguments, validates them against the schema, and executes the function.
  4. The result is inserted back into the conversation as a new message with role "tool".
  5. The model receives the updated conversation and generates the next response.
  6. This continues until the model produces a final text answer with no more tool calls.

The model never waits. It generates a complete response that either contains a tool call or does not. The execution happens outside the model. The model only sees the result when execution is finished and the result is handed back as a new context window update.

One practical consequence: the model cannot "check" whether a tool call is going to work before making it. It makes its best prediction about what arguments to pass, and the system handles the outcome. Good error handling in the tool layer, and good instructions to the model about how to handle failures, matter a lot in production.

Parallel and sequential tool calls

Modern LLMs support calling multiple tools in a single generation. The model can decide that two lookups are independent and issue both calls simultaneously. The infrastructure runs them in parallel, collects both results, and returns them together. This significantly reduces latency in agentic workflows where multiple pieces of information are needed.

Sequential tool calling happens when the output of one call is needed to determine the arguments for the next. The model calls the first tool, waits for the result, then decides whether to call another tool and what to pass it. This is the pattern you see in agents that do research: search for sources, then fetch the content of specific sources, then synthesize.

LangGraph handles this distinction explicitly at the workflow level, letting you define which nodes run in parallel and which depend on upstream results. At the model level, the decision is made by the LLM itself based on whether it needs intermediate information or not.

Structured outputs vs tool calling

These two concepts are related but not identical, and the terminology gets conflated often enough that it is worth separating them clearly.

Tool calling is specifically about the model emitting a function invocation with named arguments. The model is saying "run this function with these parameters." The output is structured as part of the mechanism.

Structured outputs is a broader capability where you constrain the model's generation to follow a specific JSON schema regardless of whether a tool is being called. You might use structured outputs to get the model to return a parsed result in a fixed format, like a list of extracted entities from a document, without necessarily calling any external function.

In practice, many providers implement both on the same underlying mechanism. When you specify a response format using JSON Schema, the model uses constrained decoding or fine-tuning to guarantee the output matches the schema. The same basic approach, generate tokens that conform to a schema, underlies both.

The distinction matters for system design. If you want the model to return data in a shape you can reliably parse, use structured outputs. If you want the model to trigger side effects by calling real functions, use tool calling. Many systems use both at different stages of the same pipeline.

How Anthropic and OpenAI implement this differently

The mechanics of tool calling work the same way across providers at a conceptual level, but the API surface differs.

OpenAI uses the term "function calling" in older documentation and "tools" in newer versions. You pass tools as a list of objects with type: "function" and a function object containing the name, description, and parameters. The model can either call a tool or not, and you can force it to call a specific one or prevent it from calling any.

Anthropic uses the term "tool use" and follows the same general pattern, but with some differences in how tool results are formatted in the conversation and how the model signals it is done. Claude Code is built on Anthropic's tool use API and exposes its reasoning through an extended thinking mechanism that runs before tool calls, letting you see why the model chose a particular tool before it executes.

The Model Context Protocol takes this a step further by standardizing tool definitions at the infrastructure level rather than the API level. MCP servers publish their available tools using a discovery mechanism, and any MCP-compatible host can find and use those tools without the developer manually writing out tool definitions for each host separately. It is tool calling made portable.

What makes a good tool definition

Most problems in tool-using agents trace back to poor tool definitions rather than model failures. A few principles that hold up in practice:

Write descriptions like documentation, not like naming conventions. The model uses the description to decide when to call the tool. "search_db" tells it nothing useful. "Query the internal product database by SKU, category, or keyword. Use this when the user asks about inventory, pricing, or product details" gives it a real decision criterion.

Be explicit about what the tool cannot do. If a search tool only covers a specific date range, say so. The model will otherwise use it in situations where it will return nothing useful, and then have to recover.

Use enums wherever possible. If a parameter has a fixed set of valid values, enumerate them. This reduces the chance of the model inventing a value that looks plausible but breaks downstream logic.

Keep tools single-purpose. A tool that does five things based on a mode parameter is harder for the model to reason about than five separate tools. The model makes better decisions when each tool has a clear, narrow purpose.

Tool calling in multi-agent systems

In single-agent systems, tools are external: the model calls the web, a database, or a code execution environment. In multi-agent systems, the tools can be other agents.

One agent can expose a well-defined interface and be called by another agent exactly like an external function. The calling agent passes structured arguments. The called agent processes the task and returns a structured result. From the calling agent's perspective, it made a tool call. From the system's perspective, a whole separate agent ran and returned output.

This is how complex workflows get composed. A research agent can call a summarization agent as a tool. A planning agent can call execution agents for each step of a plan. The same tool calling mechanism, structured arguments, structured results, handles both scenarios.

Understanding this means the concepts in how AI agents work apply recursively. Each agent in a multi-agent system is itself running a loop, calling tools, observing results. Those tools can be real functions or other agents.

Where to go from here

Tool calling is the technical foundation of most of what makes modern AI agents useful. It is worth understanding at this level even if you are not writing model infrastructure, because it directly shapes how you design systems, write prompts, and debug problems when agents do the wrong thing.

If you want to see tool calling in action in a real system, the Claude Code agent exposes its tool calls transparently as it works through coding tasks. If you want to understand the protocol layer that standardizes tools across hosts and models, MCP is the place to start. And if you want to build agentic workflows where multiple agents and tools compose into larger systems, LangGraph is one of the most capable frameworks for orchestrating that kind of architecture.

The mechanism is simple in principle. The complexity lives in the details of schema design, error handling, and knowing when to call a tool versus when to just answer directly. Those are learnable skills, and getting them right is what separates agents that work reliably from agents that hallucinate confidently.

Search